Jindřich Libovický
jlibovicky.bsky.social
Jindřich Libovický
@jlibovicky.bsky.social
Researcher at Charles University | multilingual natural language processing, machine translation
Attenzione! 🇮🇹 Know Piedmontese or Neapolitan speakers? @gianlucavico.bsky.social is collecting crowd-sourced translations to evaluate LLM performance on these regional languages. Partecipate!
We’re collecting crowd-sourced translations in Piedmontese and Neapolitan.
🎯 Goal: see how well LLMs understand these languages.
👉 Participate here (in IT🇮🇹):
- Piedmontese: quest.ms.mff.cuni.cz/crowd-transl...
- Neapolitan: quest.ms.mff.cuni.cz/crowd-transl...
Anyone can join, no need to be fluent!
Welcome to CrowdTranslation
quest.ms.mff.cuni.cz
November 10, 2025 at 2:36 PM
With @andrei-a-manea.bsky.social, we posted a survey on multilingual vision-language models 👉 arxiv.org/pdf/2509.22123
We reviewed 31 models+21 benchmarks. There's a tension between language neutrality (same results across languages) & cultural awareness (context matters differently across cultures)
arxiv.org
October 21, 2025 at 1:30 PM
So proud of my PhD student @andrei-a-manea.bsky.social for his first first-author publication! 🎉 He presented this work last week at TSD. Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders arxiv.org/pdf/2504.21681
September 1, 2025 at 3:38 PM
🧵 We're releasing CUS-QA - a new benchmark for testing LLMs on regional knowledge!
Find out what your model knows about Czechia 🇨🇿, Slovakia 🇸🇰, and Ukraine 🇺🇦!
👉 Textual and visual questions, answers, and human judgment on model outputs!
huggingface.co/datasets/ufa...
www.arxiv.org/abs/2507.22752
ufal/cus-qa · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
August 25, 2025 at 8:06 AM
Stay tuned, we will release the dataset soon...
CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset arxiv.org/abs/2507.22752
by @jlibovicky.bsky.social , ‪@jindrahelcl.bsky.social, @andrei-a-manea.bsky.social
Question that foreigners don't know the answer to + human judgment on question generation
August 1, 2025 at 4:49 PM
Reposted by Jindřich Libovický
We need to have poster fights at the end of every conference.
July 29, 2025 at 7:01 PM
Just presented MAGBIG, a new dataset and evaluation methodology for gender bias in multilingual text-to-image generation. Grammatical gender matters when studying these biases across languages!
Thanks to Felix Friedrich, @kathaem.bsky.social and all co-authors - it was fun to work on this together!
Multilingual Text-to-Image Generation Magnifies Gender Stereotypes
aclanthology.org/2025.acl-lon...
by Felix Friedrich, @kathaem.bsky.social, Patrick Schramowski, @mbrackaiml.bsky.social , @jlibovicky.bsky.social, @kerstingaiml.bsky.social, Alex Fraser
July 28, 2025 at 1:14 PM
This week I am at #ACL2025NLP in Vienna 🎡🇦🇹. Find me 🕵️ or message 💌 me if you want to chat about multilinguality or tokenization. Stop 🛑 by our poster on gender bias in text-to-image generation on Monday aclanthology.org/2025.acl-lon...
July 27, 2025 at 7:24 AM
Reposted by Jindřich Libovický
TokShop @ #ICML2025 got way more submissions than expected! 📈 We could really use a few more reviewers to help out. If you have the capacity to review a #tokenization paper by Saturday, please fill out this form: forms.gle/32A6sQHQrMSb... 🙏
TokShop 2025
Registering interest in all things tokenization at TokShop @ ICML 2025 (July 18) Consider joining the Google group for future updates! https://groups.google.com/g/tokshop
forms.gle
June 2, 2025 at 4:40 PM
Reposted by Jindřich Libovický
📣 Call for Paper Alert: TokShop @ ICML 2025
TokShop explores tokenization across all data modalities. Topics include: subword NLP techniques, multimodal approaches, multilingual challenges, post-training modification, alternative representations, and statistical perspectives.
ICML 2025 Workshop TokShop
Welcome to the OpenReview homepage for ICML 2025 Workshop TokShop
openreview.net
May 14, 2025 at 1:31 PM
Reposted by Jindřich Libovický
Got a tokenization paper that just didn't make the cut for ICML? Submit it to the Tokenization Workshop TokShop at #ICML2025 -- we'd love to see it there!
tokenization-workshop.github.io
Tokenization Workshop @ ICML 2025
tokenization-workshop.github.io
May 4, 2025 at 7:27 PM
Attending #NAACL2025 virtually. Since 2022, I've been training a classifier on papers I read to tackle the arXiv madness. Ran it on the NAACL proceedings for my personalized watch list. 🤓📺 However, it's far from perfect: Multilingual cultural awareness is great, but where is tokenization? 🤷
April 30, 2025 at 12:50 PM
We're organizing ✨Tokenization Workhop✨ TokShop❗ Join us at @icmlconf.bsky.social in July in Vancouver 🇨🇦. Follow @tokshop.bsky.social for updates! Submit your paper by May 30.
🚨 NEW WORKSHOP ALERT 🚨

We're thrilled to announce the first-ever Tokenization Workshop (TokShop) at #ICML2025 @icmlconf.bsky.social! 🎉

Submissions are open for work on tokenization across all areas of machine learning.

📅 Submission deadline: May 30, 2025
🔗 tokenization-workshop.github.io
Tokenization Workshop @ ICML 2025
tokenization-workshop.github.io
April 15, 2025 at 5:37 PM
Random take on the #TuringTest: Rather than testing machine intelligence, it can be a measure of societal awareness about #AI capabilities. The real objective isn't creating a machine that passes but educating people to think critically and avoid being deceived, so the machines do not pass the test.
April 4, 2025 at 7:37 PM
Summaries of pre-prints that I noticed and liked on arXiv in March are now on my blog jlibovicky.github.io//2025/04/02/...
Highlights from Machine Translation and Multilinguality in March 2025
EuroBERT: Scaling Multilingual Encoders for European Languages
jlibovicky.github.io
April 2, 2025 at 9:06 PM
Our paper 'Beyond Literal Token Overlap: Token Alignability for Multilinguality' will be at #NAACL2025! We show that token alignability is a stronger predictor of cross-lingual transfer than literal token overlap.

Read it here: arxiv.org/abs/2502.06468
March 10, 2025 at 3:48 PM
Short notes about what pre-prints I noticed in December and January are now on my blog: jlibovicky.github.io/2025/02/07/M...
Highlights from Machine Translation and Multilinguality in December 2024 and January 2025
MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost
jlibovicky.github.io
February 7, 2025 at 7:29 PM
Join Mu-SHROOM 🍄, a SemEval 2025 shared task on detecting hallucination spans in multilingual LLM outputs! 🌍 Includes Czech with regional Czech questions 🇨🇿. Do you think you can spot when something isn’t true? 🤔 Try it out! 👉 helsinki-nlp.github.io/shroom #SemEval2025 #NLP
Welcome to SemEval-2025 Task-3 — Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes
helsinki-nlp.github.io
January 14, 2025 at 3:56 PM
Happy holidays! 🎄🎅🤩🎁
December 24, 2024 at 1:36 PM
Highlights from multilingual #NLP and machine translation papers I found on arXiv in November are now on my blog: jlibovicky.github.io/2024/12/05/M...
Highlights from Machine Translation and Multilinguality in November 2024
Mitigating Metric Bias in Minimum Bayes Risk Decoding
jlibovicky.github.io
December 6, 2024 at 5:07 PM
This is going to be fun! 🤓 We have three years to spend 6.5M CZK on improving multilingual tokenization. The goal is to make subwords more alignable across languages and help languages that suffer from over-segmentation with current models.
Good news! 🥳 GAČR will fund two of our projects:
👉 @jlibovicky.bsky.social proposes to better tokenization for #LLMs and machine translation
👉 Veronika Kolářová will study syntactic features of Czech non-verbal predicates
➕ Dominik Macháček receives Postdoc Individual Fellowship! 💪
December 3, 2024 at 5:53 PM
Just shared my takeaways from #EMNLP2024 on my blog: jlibovicky.github.io//2024/11/21/...
Notes from EMNLP 2024
Last week, I was at EMNLP in Miami, and here are a few notes about what I saw at the conference.
jlibovicky.github.io
November 21, 2024 at 11:36 AM
Reposted by Jindřich Libovický
Hello Blue Sky! 👋 This is the official account of the Institute of Formal and Applied Linguistics (ÚFAL for short) at the Faculty of Mathematics and Physics, Charles University in Prague 🇨🇿. Here, we will share news from the life of the institute and our members.
November 19, 2024 at 7:09 PM
Hello, Blue Sky 👋 🦋 Looking forward to reading and sometimes also posting about #NLP and related stuff! 🤓
November 19, 2024 at 5:35 PM