The phi-3-mini model is a transformer decoder architecture with default context length of 4K, extended to 128K in its long context variant. It employs a similar structure to Llama-2 for compatibility with open-source projects. The phi-3-small model, featuring 7 billion parameters, uses the tiktoken tokenizer and adopts a standard decoder architecture with enhanced performance through GEGLU activation and Maximal Update Parametrization. Also, a blocksparse attention module was introduced for improved efficiency in training and inference.
The phi-3-mini model is a transformer decoder architecture with a default context length of 4K, extending to 128K in the long context version, phi-3-mini-128K.
The phi-3-small model, with 7 billion parameters, utilizes a tiktoken tokenizer, 32 heads, and a hidden size of 4096, ensuring better multilingual tokenization.
We switched from GELU activation to GEGLU for improved performance and training stability in the phi-3-small model, optimizing hyperparameters using Maximal Update Parametrization.
A novel blocksparse attention module was designed to optimize training and inference speed by applying different sparsity patterns over KV cache for each attention head.
Collection
[
|
...
]