New IBM Granite 4 Models to Reduce AI Costs with Inference-Efficient Hybrid Mamba-2 Architecture
Briefly

New IBM Granite 4 Models to Reduce AI Costs with Inference-Efficient Hybrid Mamba-2 Architecture
"IBM attributes those improved characteristics vs. larger models to its hybrid architecture that combines a small amount of standard transformer-style attention layers with a majority of Mamba layers-more specifically, Mamba-2. With 9 Mamba blocks per 1 Transformer block, Granite gets linear scaling vs. context length for the Mamba parts (vs. quadratic scaling in transformers), plus local contextual dependencies from transformer attention (important for in-context learning or few-shots prompting)."
"Additionally, Granite being a mixture-of-experts system, only a subset of the weights is used in any forward pass. This also contributes to keep inference cost lower. Granite ships three model variants with the hybrid architecture. conveniently called Micro, Tiny, and Small to cater to different use cases. At one end, Micro (3B parameters) addresses high-volume, low-complexit"
"IBM recently announced the Granite 4.0 family of small language models. The model family aims to deliver faster speeds and significantly lower operational costs at acceptable accuracy vs. larger models. Granite 4.0 features a new hybrid Mamba/transformer architecture that largely reduces memory requirements, enabling Granite to run on significantly cheaper GPUs and at significantly reduced costs."
Granite 4.0 is a family of small language models optimized for faster inference and much lower operational cost while preserving acceptable accuracy compared to larger models. The models use a hybrid Mamba/transformer architecture that substantially reduces memory requirements, enabling deployment on cheaper GPUs and reducing expenses. The hybrid approach mixes a few transformer-style attention layers with predominantly Mamba-2 blocks, producing linear scaling for most operations with respect to context length while retaining local attention for in-context learning. Granite also uses a mixture-of-experts design so only subsets of weights run per forward pass. Three variants address differing throughput and complexity needs.
Read at InfoQ
Unable to calculate read time
[
|
]