Niyati Bafna
niyatibafna.bsky.social
Niyati Bafna
@niyatibafna.bsky.social
PhD student @jhuclsp. Previously @AIatMeta, @InriaParisNLP, @EM_LCT| #NLProc
Accepted at ACL main! Come chat about dialectal MT at our poster today at 4 pm.
Also, check out this largely bug-free package for generating your own synthetic dialectal data:
pypi.org/project/dial...
Dialects lie on continua of (structured) linguistic variation, right? And we can’t collect data for every point on the continuum...🤔
📢 Check out DialUp, a technique to make your MT model robust to the dialect continua of its training languages, including unseen dialects.
arxiv.org/abs/2501.16581
July 29, 2025 at 12:14 PM
Reposted by Niyati Bafna
You have a budget to human-evaluate 100 inputs to your models, but your dataset is 10,000 inputs. Do not just pick 100 randomly!🙅

We can do better. "How to Select Datapoints for Efficient Human Evaluation of NLG Models?" shows how.🕵️
(random is still a devilishly good baseline)
July 15, 2025 at 1:03 PM
🔈When LLMs solve tasks with a mid-to-low resource input or target language, their output quality is poor. We know that. But can we put our finger on what breaks inside the LLM? We introduce the 💥 translation barrier hypothesis 💥 for failed multilingual generation with LLMs. arxiv.org/abs/2506.22724
July 4, 2025 at 5:05 PM
We know that speech LID systems flunk on accented speech. But why? And what can we do about it? 🤔
Our work arxiv.org/abs/2506.00628 (Interspeech '25) finds that *accent-language confusion* is an important culprit, ties it to the length of feature that the model relies on, and proposes a fix.
June 7, 2025 at 5:27 PM
Presented DialUp (MT, dialect continua, robustness, etc.; arxiv.org/abs/2501.16581) to some new people this week! Thanks Hale and @schmidtsciences.bsky.social for inviting me up to New York 🥯

Saw some magnolias too :)
April 11, 2025 at 12:50 AM
Dialects lie on continua of (structured) linguistic variation, right? And we can’t collect data for every point on the continuum...🤔
📢 Check out DialUp, a technique to make your MT model robust to the dialect continua of its training languages, including unseen dialects.
arxiv.org/abs/2501.16581
February 27, 2025 at 2:44 AM