🎯 Goal: see how well LLMs understand these languages.
👉 Participate here (in IT🇮🇹):
- Piedmontese: quest.ms.mff.cuni.cz/crowd-transl...
- Neapolitan: quest.ms.mff.cuni.cz/crowd-transl...
Anyone can join, no need to be fluent!
We reviewed 31 models+21 benchmarks. There's a tension between language neutrality (same results across languages) & cultural awareness (context matters differently across cultures)
We reviewed 31 models+21 benchmarks. There's a tension between language neutrality (same results across languages) & cultural awareness (context matters differently across cultures)
Find out what your model knows about Czechia 🇨🇿, Slovakia 🇸🇰, and Ukraine 🇺🇦!
👉 Textual and visual questions, answers, and human judgment on model outputs!
huggingface.co/datasets/ufa...
www.arxiv.org/abs/2507.22752
Find out what your model knows about Czechia 🇨🇿, Slovakia 🇸🇰, and Ukraine 🇺🇦!
👉 Textual and visual questions, answers, and human judgment on model outputs!
huggingface.co/datasets/ufa...
www.arxiv.org/abs/2507.22752
by @jlibovicky.bsky.social , @jindrahelcl.bsky.social, @andrei-a-manea.bsky.social
Question that foreigners don't know the answer to + human judgment on question generation
Thanks to Felix Friedrich, @kathaem.bsky.social and all co-authors - it was fun to work on this together!
aclanthology.org/2025.acl-lon...
by Felix Friedrich, @kathaem.bsky.social, Patrick Schramowski, @mbrackaiml.bsky.social , @jlibovicky.bsky.social, @kerstingaiml.bsky.social, Alex Fraser
Thanks to Felix Friedrich, @kathaem.bsky.social and all co-authors - it was fun to work on this together!
TokShop explores tokenization across all data modalities. Topics include: subword NLP techniques, multimodal approaches, multilingual challenges, post-training modification, alternative representations, and statistical perspectives.
TokShop explores tokenization across all data modalities. Topics include: subword NLP techniques, multimodal approaches, multilingual challenges, post-training modification, alternative representations, and statistical perspectives.
tokenization-workshop.github.io
tokenization-workshop.github.io
We're thrilled to announce the first-ever Tokenization Workshop (TokShop) at #ICML2025 @icmlconf.bsky.social! 🎉
Submissions are open for work on tokenization across all areas of machine learning.
📅 Submission deadline: May 30, 2025
🔗 tokenization-workshop.github.io
Read it here: arxiv.org/abs/2502.06468
Read it here: arxiv.org/abs/2502.06468
👉 @jlibovicky.bsky.social proposes to better tokenization for #LLMs and machine translation
👉 Veronika Kolářová will study syntactic features of Czech non-verbal predicates
➕ Dominik Macháček receives Postdoc Individual Fellowship! 💪