Mistral AI has released a suite of open source models under the Mistral 3 banner, aiming to scale from a mobile device or drone up to multi-GPU datacenter beasts. While the French company does not share its training data, the decision to open source the models under the Apache 2.0 license is notable. "Open sourcing our models is about empowering the developer community and really putting AI in people's hands, allowing them to own their AI future," Mistral said.
IBM attributes those improved characteristics vs. larger models to its hybrid architecture that combines a small amount of standard transformer-style attention layers with a majority of Mamba layers-more specifically, Mamba-2. With 9 Mamba blocks per 1 Transformer block, Granite gets linear scaling vs. context length for the Mamba parts (vs. quadratic scaling in transformers), plus local contextual dependencies from transformer attention (important for in-context learning or few-shots prompting).
According to Microsoft, MAI-1-preview uses an in-house mixture-of-experts model that was pre-trained and post-trained on 15,000 Nvidia H100 GPUs, a more modest infrastructure than the 100,000 H100 cluster sizes reportedly used for model development by some rivals. However, with an eye to ramping up performance, Microsoft AI is now running MAI-1-preview on Nvidia's more powerful GB200 cluster, the company said.