Niyati Bafna
niyatibafna.bsky.social
Niyati Bafna
@niyatibafna.bsky.social
PhD student @jhuclsp. Previously @AIatMeta, @InriaParisNLP, @EM_LCT| #NLProc
In general, intermediate accuracy stays high even for LRL targets, but final accuracy quickly drops. And so TLP is high for most target languages (>50%). Except for low-resource *source* languages, in which case task-solving fails before we get to translation.
July 4, 2025 at 5:05 PM
We then quantify *translation loss proportion*: the proportion of failure cases that had successful task-solving but failed translation (see paper for less hand-waviness). We look at intermediate task-solving accuracy (over all layers), final accuracy, and TLP.
July 4, 2025 at 5:05 PM
What languages does task-solving occur in? We look at the distribution over languages of correct intermediate outputs and see that 1) English dominates 2) But other supported HRLs have a considerable combined presence! Also, this mix looks largely the same regardless of target language.
July 4, 2025 at 5:05 PM
We visualize the task-solving—>translation pipeline, showing that intermediate layers have high *off-target* accuracy (task-solving), which gets converted (via translation) to *on-target* accuracy near the final layers. For HRLs. For LRL target languages, translation fails, resulting in bad outputs.
July 4, 2025 at 5:05 PM
This hypothesis says that 1) Multilingual generation uses a model-internal task-solving→translation cascade. 2) Failure of the translation stage *despite task-solving success* is a large part of the problem. That is, the model often solves the task but fails to articulate the answer.
July 4, 2025 at 5:05 PM
🔈When LLMs solve tasks with a mid-to-low resource input or target language, their output quality is poor. We know that. But can we put our finger on what breaks inside the LLM? We introduce the 💥 translation barrier hypothesis 💥 for failed multilingual generation with LLMs. arxiv.org/abs/2506.22724
July 4, 2025 at 5:05 PM
This module by itself shows very little accent-language confusion. In combination with the ECAPA-TDNN model, it shows large improvements on LID for L2-accented speech in English, French, and German, and minimal degradation on mainstream accented speech.
June 7, 2025 at 5:27 PM
Okay, so how do we fix this problem? We investigate using a module that incorporates long-range information to help out. We look at two representations of the input: as a sequence of phones and a sequence of discretised SSL representations. And we put a classifier on top.
June 7, 2025 at 5:27 PM
To test this, we look at *block permutation invariance* i.e. the length of ordered as well as unordered input features that SOTA models rely on; our experiments indicate that they use features describing only about 1-2 phones.
June 7, 2025 at 5:27 PM
Accent-language confusion: The mis-recognition of L2-accented speech as the L1 substrate or a related language. For example, when Indonesian-accented English is classified as Indonesian, Malay, etc. A large part of model error on L2-accented speech follows this pattern!
June 7, 2025 at 5:27 PM
We know that speech LID systems flunk on accented speech. But why? And what can we do about it? 🤔
Our work arxiv.org/abs/2506.00628 (Interspeech '25) finds that *accent-language confusion* is an important culprit, ties it to the length of feature that the model relies on, and proposes a fix.
June 7, 2025 at 5:27 PM
Presented DialUp (MT, dialect continua, robustness, etc.; arxiv.org/abs/2501.16581) to some new people this week! Thanks Hale and @schmidtsciences.bsky.social for inviting me up to New York 🥯

Saw some magnolias too :)
April 11, 2025 at 12:50 AM
Similarity between the dialect and the HRL also matters! If these are too close or too far, DialUp doesn’t help much. This makes sense: in the first case, the dialect already probably does well, and in the second, it probably has a larger number of non-cognates that DialUp can’t help with.
February 27, 2025 at 2:44 AM
Dialects with low baseline performance benefit more: high-performing dialects have less to gain.
A decision tree trained on features describing language similarity, baseline performance, and language resourcedness tells us that baseline performance is the most important factor in explaining gains.
February 27, 2025 at 2:44 AM
Swapping out function words is much more useful than swapping out content words! Function words change unpredictably, and cause issues for models. Don’t ignore these!
Most lexicons do, which is why we had to curate our own function word lexicons using statistical alignment on small bitext.
February 27, 2025 at 2:44 AM
M—>D consistently gains on all languages. D—>M helps as much or more on average, but has much larger variance, often damaging high-resource dialect performance. This makes sense: if the model already knows words in some input dialect, introducing code-switching by swapping them out can only hurt.
February 27, 2025 at 2:44 AM
We test our methods on 49 closely-related languages of 6 families, with two models, including seen and unseen dialects over a range of resourcedness.
DialUp helps a lot for some families and languages! 9 languages, mostly from the Indic and Romance families, show gains of 10+ BLEU points with M2M.
February 27, 2025 at 2:44 AM
D—>M is an inference-time trick that assumes you know the input dialect. It uses bilingual lexicons to swap out input words for HRL words that the model is familiar with.
We separate out content and functional words, because these behave differently in how they vary between dialects.
February 27, 2025 at 2:44 AM
Now we can use HRL bitext to generate synthetic parallel data in various artificial dialects, that have the same *mechanisms of linguistic divergence* from the HRL as real dialects.
We finetune on this data, and evaluate on actual dialects. That’s M—>D.
February 27, 2025 at 2:44 AM
Remember when we said we could simulate artificial dialects of a language using phonological, morphological, and lexical perturbations?
(In this paper: aclanthology.org/2024.emnlp-m...)
Briefly: we add linguistically motivated noise on top of HRL text, in order to mimic dialectal variation.
February 27, 2025 at 2:44 AM
DialUp has two components: M—>D adapts your model to the data, by fine-tuning it on simulated dialectal variation, and D—>M takes your dialectal data closer to model expectations, by swapping out vocabulary for HRL equivalents.
February 27, 2025 at 2:44 AM
Dialects lie on continua of (structured) linguistic variation, right? And we can’t collect data for every point on the continuum...🤔
📢 Check out DialUp, a technique to make your MT model robust to the dialect continua of its training languages, including unseen dialects.
arxiv.org/abs/2501.16581
February 27, 2025 at 2:44 AM
Similarity between the dialect and the HRL also matters! If these are too close or too far, DialUp doesn’t help much. This makes sense: in the first case, the dialect probably already does well, and in the second, it probably has a larger number of non-cognates that DialUp can’t help with.
February 27, 2025 at 2:24 AM
Dialects with low baseline performance benefit more. This makes sense: well-performing dialects have less to gain.
A decision tree on features describing language similarity, baseline performance, and resourcedness tells us that baseline performance is the most important factor explaining gains.
February 27, 2025 at 2:24 AM
Swapping out function words is much more effective than swapping out content words! Function words change unpredictably, and cause issues for models. Don’t ignore these!
Most lexicons do, which is why we had to curate our own function word lexicons using statistical alignment on small bitext.
February 27, 2025 at 2:24 AM