Omnivision's unique architecture integrates three core components: Qwen-2.5-0.5B for text processing, SigLIP-400M for image encoding, and an MLP projection layer for visual-language integration.
With a 9x reduction in image tokens, Omnivision boasts reduced latency and computational requirements, allowing it to generate image captions in under two seconds on a MacBook M4 Pro.
The training pipeline consists of three stages: pretraining for foundational capabilities, supervised fine-tuning for context interpretation, and Direct Preference Optimization to enhance precision and minimize inaccuracies.
By employing Direct Preference Optimization, Omnivision leverages high-quality datasets to mitigate hallucinations and ensures a high level of accuracy and reliability in its predictions.
Collection
[
|
...
]