⚙️Trained on 100B tokens from HPLT v2 dataset
🌍 Cover EU langs + others
⚙️ Based on LLaMA, trained on #LUMI
📈 Useful for evaluation
Downloads + more info at openeurollm.eu/blog/hplt-oe...
Part of the [**MaLA Corpus**](huggingface.co/collections/...), deduplicated dataset from [OPUS](opus.nlpl.eu) (cutoff Oct 2024) features **16,829 language pairs** with deduplication, normalization, and noise filtering
Part of the [**MaLA Corpus**](huggingface.co/collections/...), deduplicated dataset from [OPUS](opus.nlpl.eu) (cutoff Oct 2024) features **16,829 language pairs** with deduplication, normalization, and noise filtering
- Ayodele Awokoya
- Wilker Aziz
- Marta Costa-Jussa
- Barry Haddow
- Amit Moryosse
- Sara Papi
- Jörg Tiedemann
- Marco Turchi
- Ayodele Awokoya
- Wilker Aziz
- Marta Costa-Jussa
- Barry Haddow
- Amit Moryosse
- Sara Papi
- Jörg Tiedemann
- Marco Turchi
See here: www.nodalida-bhlt2025.eu/proceedings
See you also soon in Tallinn!
#NLP #NLProc #nodalida #baltichlt
See here: www.nodalida-bhlt2025.eu/proceedings
See you also soon in Tallinn!
#NLP #NLProc #nodalida #baltichlt