The rapid advancements in Large Language Models (LLMs) have significantly propelled the development of vision-language models based on LLMs, equipped with advanced question-answering and visual comprehension skills.
A common limitation of open-sourced Vision-Language Models (VLMs) is their substantial computational demands, typically ranging from 7B to 65B parameters, posing significant deployment challenges.
Gemini has released compact vision-language models like Gemini-Nano with 1.8B/3.25B parameters, tailored for smartphones, though their models and data are not open-sourced.
Our paper explores and demonstrates the effectiveness of integrating vision-language models with open-sourced, smaller language models, assessing their potential and efficiency in various applications.
#vision-language-models #large-language-models #computational-challenges #mobile-applications #open-source
Collection
[
|
...
]