Lightnews — Scholar-powered news

Reposted by Paul Lerner

MLIA ISIR

@mlia-isir.bsky.social

Accepted to a Workshop (1/2):

"Self-Retrieval from Distant Contexts for Document-Level Machine Translation", accepted to the Conference on Machine Translation (WMT25), from @ziqianpeng.bsky.social, @rachelbawden.bsky.social, @yvofr.bsky.social

October 28, 2025 at 8:57 AM

Paul Lerner

@lernerp.bsky.social

There's many directions where this could go, multilingual, low-resource language, interpretability, depending on your profile, and the internship may lead to a PhD, provided we get funding!

November 6, 2025 at 9:07 AM

Paul Lerner

@lernerp.bsky.social

As we found in aclanthology.org/2025.coling-... that BPE-based LLMs (i.e. pretty much all LLMs) did not handle prefixations well

Unlike “Likely”, “Unlike” is Unlikely: BPE-based Segmentation hurts Morphological Derivations in LLMs

Paul Lerner, François Yvon. Proceedings of the 31st International Conference on Computational Linguistics. 2025.

aclanthology.org

November 6, 2025 at 9:06 AM

Paul Lerner

@lernerp.bsky.social

Basically the idea is to extend www.pnas.org/doi/10.1073/... to see how well LLMs model competition between affixes, not only suffixes (e.g. -ity vs. -ness) but also prefixes (e.g. un- vs. non-)

Derivational morphology reveals analogical generalization in large language models | PNAS

What mechanisms underlie linguistic generalization in large language models (LLMs)? This question has attracted considerable attention, with most s...

www.pnas.org

November 6, 2025 at 9:04 AM

Paul Lerner

@lernerp.bsky.social

now on arxiv arxiv.org/abs/2510.20508

Assessing the Political Fairness of Multilingual LLMs: A Case Study based on a 21-way Multiparallel EuroParl Dataset

The political biases of Large Language Models (LLMs) are usually assessed by simulating their answers to English surveys. In this work, we propose an alternative framing of political biases, relying o...

arxiv.org

October 24, 2025 at 7:30 AM

Paul Lerner

@lernerp.bsky.social

work done with @yvofr.bsky.social as part of the Democratic Commons programme, many thanks to our colleagues at Make, Sciences Po, and Sorbonne! about.make.org/democratic-c...

Landing Page

about.make.org

October 23, 2025 at 4:16 PM

Paul Lerner

@lernerp.bsky.social

the dataset and code are available github.com/PaulLerner/2...

GitHub - PaulLerner/21-EuroParl: Dataset and code for the paper "Assessing the Political Fairness of Multilingual LLMs: A Case Study based on a 21-way Multiparallel EuroParl Dataset" (Lerner and Yvon,...

Dataset and code for the paper "Assessing the Political Fairness of Multilingual LLMs: A Case Study based on a 21-way Multiparallel EuroParl Dataset" (Lerner and Yvon, 2025) - PaulLerner/...

github.com

October 23, 2025 at 4:13 PM

Paul Lerner

@lernerp.bsky.social

here's what one example of the dataset looks like, there are 72,234 just like this one (I regret my multimodal days where there were pictures in my papers)

October 23, 2025 at 4:09 PM

Paul Lerner

@lernerp.bsky.social

I tried for a pythonic library, have a look at the example notebook colab.research.google.com/github/PaulL...

Google Colab

colab.research.google.com

October 15, 2025 at 5:27 PM

Paul Lerner

@lernerp.bsky.social

🤔 ppllm is benchmarked against:
- a vllm-based implementation: 4.15 times faster!
- a naive hugging face implementation, which does not sort texts by length: 4.61 times faster!

October 15, 2025 at 5:26 PM

Paul Lerner

@lernerp.bsky.social

🤔 ppllm implements windowed PPL, which allows to compute the PPL of arbitrarily long texts.
It aims to be feature complete for many information-theoretic metrics, including Perplexity (PPL), Surprisal, and bits per character (BPC), and their word-level counterparts.

October 15, 2025 at 5:26 PM

Paul Lerner

@lernerp.bsky.social

Work done with Laurène Cave,
@haldaume3.bsky.social, Léo Labat, Gaël Lejeune, Pierre-Antoine Lequeu,
@bpiwowar.bsky.social, Nazanin Shafiabadi and @yvofr.bsky.social, read the paper here talnarchives.atala.org/ateliers/202...
Any feedback is appreciated :)

talnarchives.atala.org

July 7, 2025 at 8:02 AM

Reposted by Paul Lerner

MLIA ISIR

@mlia-isir.bsky.social

For the EALM Workshop
"On Assessing the Political Biases of Multilingual Large Language Models" by @lernerp.bsky.social Laurène Cave, @haldaume3.bsky.social Léo Labat, Gaël Lejeune, Pierre-Antoine Lequeu, @bpiwowar.bsky.social Nazanin Shafiabadi and yvofr.bsky.social, collaborated with the STIH lab

June 10, 2025 at 6:39 PM