
"In a Tuesday post on its Engineering blog, four Grab staffers explained that the company needs to accurately extract information from ID cards, driver's licenses, and registration certificates for compliance chores like know-your-customer checks. Grab tried Optical Character Recognition (OCR) systems, but its chosen tech "struggled with the variety of document templates it had to process." It's 2025, so the org investigated whether large language models could solve its problem."
""While powerful proprietary Large Language Models (LLMs) were an option, they often fell short in understanding [South East Asian] SEA languages, produced errors, hallucinations, and had high latency," the post reveals. "On the other hand, open-sourced Vision LLMs were more efficient but not accurate enough for production." The company decided building its own Vision LLM - a model that vectorizes images so a large language model can extract text - was its best option."
Grab operates a superapp offering ride-sharing, food delivery, shopping, and financial services across multiple Southeast Asian countries that use non-Latin scripts. The company must extract information from ID cards, driver's licenses, and registration certificates to meet compliance and know-your-customer requirements. Conventional OCR systems struggled with diverse document templates. Proprietary LLMs underperformed on Southeast Asian languages, produced errors and hallucinations, and exhibited high latency. Open-source vision LLMs were efficient but lacked production-grade accuracy. Grab chose to build its own Vision LLM and selected Alibaba Cloud's Qwen2-VL 2B for fine-tuning due to its small size and tokenizer support for Thai and Vietnamese.
 Read at Theregister
Unable to calculate read time
 Collection 
[
|
 ... 
]