The article discusses a novel multi-modal retrieval system that utilizes large language models (LLMs) to effectively match speech and text across 102 languages. Unlike previous systems that rely on speech-data during the pre-training phase, this approach utilizes LLMs' inherent text understanding capabilities, achieving significant improvements in performance metrics like Recall@1. The method demonstrates remarkable generalizability even for languages not present during initial training, indicating its potential for cross-lingual applications and enhancing multilingual capabilities with the assistance of machine translation.
The proposed multi-modal retrieval system leverages large language models to match speech and text in 102 languages, enhancing performance without needing paired speech data during pre-training.
Our approach achieves significant advancements, with a 10% absolute improvement in Recall@1, showcasing the model's capability to generalize across languages unseen during training.
The integration of multilingual text understanding within our model enables effective cross-lingual matching, demonstrating its robustness and the potential of machine translation as an augmenting tool.
Traditional methods typically rely on direct speech data, but our system capitalizes on the already acquired capabilities of LLMs, facilitating broader accessibility and integration.
#large-language-models #multi-modal-retrieval #cross-lingual-matching #speech-and-text-recognition #machine-translation
Collection
[
|
...
]