Meta Details GEM Ads Model Using LLM-Scale Training, Hybrid Parallelism, and Knowledge Transfer

"Meta released details about its Generative Ads Model (GEM), a foundation model designed to improve ads recommendation across its platforms. The model addresses core challenges in recommendation systems (RecSys) by processing billions of daily user-ad interactions where meaningful signals such as clicks and conversions are very sparse. GEM tackles the complexity of learning from diverse ads data including advertiser goals, creative formats, measurement signals, and user behaviors across multiple delivery channels."

"The company built the system using three approaches: model scaling with advanced architecture, post-training techniques for knowledge transfer, and enhanced training infrastructure that uses thousands of GPUs with advanced parallelism to support the computational demands of large-scale foundation model training. Meta re-engineered its training stack to support GEM at a scale comparable to modern large language models. The company employs multi-dimensional parallelism strategies tailored to different model components."

"Dense model parts use Hybrid Sharded Distributed Parallel (HSDP) to optimize memory usage and reduce communication costs across thousands of GPUs. Sparse components, primarily large embedding tables for user and item features, use a two-dimensional approach combining data parallelism and model parallelism. Meta implemented several GPU-level optimizations to reduce training bottlenecks. These include a custom in-house GPU kernel designed for variable-length user sequences, graph-level compilation in PyTorch 2.0 that automates activation checkpointing and operator fusion, and memory compression techniques such as FP8 quantization for activations."

GEM is a foundation model built to improve ads recommendation by learning from billions of sparse user-ad interactions across multiple delivery channels. The model ingests diverse ads data including advertiser goals, creative formats, measurement signals, and user behaviors. Development used three vectors: scaling the model architecture, applying post-training techniques for knowledge transfer, and enhancing training infrastructure to run on thousands of GPUs with advanced parallelism. Dense layers use Hybrid Sharded Distributed Parallelism to save memory and cut communication costs while sparse embedding tables use combined data and model parallelism. GPU-level optimizations include a custom kernel for variable-length sequences, PyTorch 2.0 graph compilation, FP8 activation quantization, and NCCLX communication collectives to avoid compute-communication contention.

#generative-ads-model #recommendation-systems #large-scale-distributed-training #gpu-optimizations #embeddings

Read at InfoQ

Unable to calculate read time

Collection

[

...

]

Meta Details GEM Ads Model Using LLM-Scale Training, Hybrid Parallelism, and Knowledge TransferMeta Details GEM Ads Model Using LLM-Scale Training, Hybrid Parallelism, and Knowledge Transfer Briefly

Meta Details GEM Ads Model Using LLM-Scale Training, Hybrid Parallelism, and Knowledge Transfer
Meta Details GEM Ads Model Using LLM-Scale Training, Hybrid Parallelism, and Knowledge Transfer
Briefly