Meta has unveiled the Llama 4 family of AI models, introducing Llama 4 Scout and Llama 4 Maverick. These models represent a significant leap forward in open-source generative AI, combining multimodality and a Mixture of Experts (MoE) architecture for enhanced performance and efficiency.
Llama 4 Scout is a multimodal model with 17 billion active parameters and 16 experts. It is designed to be efficient, fitting on a single NVIDIA H100 GPU, and boasts an industry-leading context window of 10 million tokens. This large context window allows Scout to perform tasks such as multi-document summarization, parsing user activity for personalized tasks, and reasoning over vast codebases. Meta claims that Llama 4 Scout outperforms models like Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across a range of benchmarks. It is pre-trained and post-trained with a 256K context length, empowering it with advanced length generalization capability. A key innovation in its architecture is the use of interleaved attention layers without positional embeddings, along with inference time temperature scaling of attention to enhance length generalization. This model is available on Workers AI.
Llama 4 Maverick, also featuring 17 billion active parameters, utilizes a larger set of 128 experts within its MoE architecture. This model is designed for a best-in-class performance-to-cost ratio and excels in image and text understanding across 12 languages. Meta positions Maverick as a workhorse for general assistant and chat applications, highlighting its capabilities in precise image understanding and creative writing. Maverick beats GPT-4o and Gemini 2.0 Flash across several benchmarks and achieves comparable results to DeepSeek v3 on reasoning and coding, using less than half the active parameters. An experimental chat version of Llama 4 Maverick scores an ELO of 1417 on LMArena.
Both Llama 4 Scout and Llama 4 Maverick are the first open-weight, natively multimodal models built using a Mixture of Experts (MoE) architecture. In MoE models, only a fraction of the total parameters are activated for each token, making them more compute-efficient for both training and inference. Llama 4 Maverick, for example, has 17 billion active parameters but 400 billion total parameters. The MoE layers use 128 routed experts and a shared expert.
Meta has also previewed Llama 4 Behemoth, a larger model with 288 billion active parameters that is still in training. The company claims that Behemoth outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM-focused benchmarks like MATH-500 and GPQA Diamond. This model is intended to serve as a teacher model for the other Llama 4 models. CEO Mark Zuckerberg has also mentioned that there will be a Llama 4 Reasoning model coming in the next month.
The Llama 4 models are available on various platforms, including Meta AI (for WhatsApp, Messenger, and Instagram Direct), the Llama website, and Hugging Face. Amazon Web Services (AWS) has announced the availability of Llama 4 Scout and Llama 4 Maverick on Amazon SageMaker JumpStart, with availability as fully managed, serverless models in Amazon Bedrock coming soon. NVIDIA has optimized both Llama 4 Scout and Llama 4 Maverick for NVIDIA TensorRT-LLM and will package the Llama 4 models as NVIDIA NIM microservices for easy deployment on any GPU-accelerated infrastructure.
The models were trained on diverse datasets, including text, images, and videos, using techniques like MetaP and FP8 precision to boost quality and efficiency. They support over 200 languages and are compatible with platforms like WhatsApp, Messenger, and Instagram Direct.