Hugging Face Releases FineTranslations, a Trillion-Token Multilingual Parallel Text Dataset

"The dataset was created by translating non-English content from the FineWeb2 corpus into English using Gemma3 27B, with the full data generation pipeline designed to be reproducible and publicly documented. The dataset is primarily intended to improve machine translation, particularly in the English→X direction, where performance remains weaker for many lower-resource languages. By starting from text originally written in non-English languages and translating it into English, FineTranslations provides large-scale parallel data suitable for fine-tuning existing translation models."

"Beyond translation, Hugging Face reports that the resulting English corpus retains substantial cultural and contextual information from the source languages. In internal experiments, models trained on the translated English text achieved performance comparable to those trained on the original FineWeb dataset, suggesting that FineTranslations can also serve as a high-quality supplement for English-only model pretraining. The dataset is sourced from FineWeb2, which aggregates multilingual web content from CommonCrawl snapshots collected"

FineTranslations is a large-scale multilingual parallel dataset with more than one trillion tokens across English and over 500 languages, produced by translating non-English FineWeb2 content into English using Gemma3 27B. The dataset targets improved machine translation, especially English→X for lower-resource languages, and provides parallel data suitable for fine-tuning translation models. Internal evaluations indicate translated English performs similarly to FineWeb for English-only training, enabling reuse beyond translation tasks. Source material comes from FineWeb2 (CommonCrawl 2013–2024) with language subsets filtered for bible_wiki_ratio below 0.5 and up to 50 billion tokens processed per language. Quality classifiers from FineWeb2-HQ were applied where available, otherwise random sampling was used. Translation was performed at scale using the datatrove framework.

#multilingual-parallel-dataset #machine-translation #fineweb2 #gemma3-27b

Read at InfoQ

Unable to calculate read time

Collection

[

...

]

Hugging Face Releases FineTranslations, a Trillion-Token Multilingual Parallel Text DatasetHugging Face Releases FineTranslations, a Trillion-Token Multilingual Parallel Text Dataset Briefly

Hugging Face Releases FineTranslations, a Trillion-Token Multilingual Parallel Text Dataset
Hugging Face Releases FineTranslations, a Trillion-Token Multilingual Parallel Text Dataset
Briefly