Shayne Longpre
banner
shaynelongpre.bsky.social
Shayne Longpre
@shaynelongpre.bsky.social
PhD @ MIT. Prev: Google Deepmind, Apple, Stanford. 🇨🇦 Interests: AI/ML/NLP, Data-centric AI, transparency & societal impact
This work provides the scientific foundation for democratizing scaling laws beyond English.

Full paper: arxiv.org/pdf/2510.22037

Huge thanks to my brilliant co-authors: Sneha, Niklas, I-Hung, Isaac, Sandy, Sercan, Chen-Yu, and Sayna!
arxiv.org
October 28, 2025 at 2:03 PM
Q4: When should you pretrain from scratch vs finetune a multilingual checkpoint?

🌟Answer: We found compute-optimal crossover points for every model size.

Rough rule of thumb: finetune if your compute budget C is < 10^10 x N ^1.54, otherwise pretrain.

8/
October 28, 2025 at 2:03 PM
Remarkably, this means 32% less data per language due to positive cross-lingual transfer—but you still need more total compute.

The curse is real but quantifiable: ϕ=0.11 (capacity penalty), ψ=-0.04 (data benefit from transfer).

7/
October 28, 2025 at 2:03 PM
Q3: How much do you need to scale when adding languages? (The "curse of multilinguality")

🌟Answer: We derived closed-form equations! To go from K to 4K languages while maintaining performance: scale data by 2.74×, model size by 1.4×.

6/
October 28, 2025 at 2:03 PM
🌟Key insight:🌟 shared script beats shared language family for positive transfer!

Languages sharing writing systems (e.g., Latin) show dramatically better transfer (mean: -0.23) vs different scripts (mean: -0.39).

Also important: transfer is often asymmetric—A helping B ≠ B helping A.

5/
October 28, 2025 at 2:03 PM
Q2: Which languages actually help each other during training? And how much?

🌟Answer: We measure this empirically. We built a 38×38 transfer matrix, or 1,444 language pairs—the largest such resource to date.

We highlight the top 5 most beneficial source languages for each target language.

4/
October 28, 2025 at 2:03 PM
ATLAS models cross-lingual transfer explicitly: separating (1) target language data, (2) beneficial transfer languages, and (3) other languages.

Without modeling transfer, existing laws fail on multilingual settings.

3/
October 28, 2025 at 2:03 PM
Q1: Can we build a scaling law that generalizes to unseen model sizes (N), data amounts (D), AND language mixtures (M)?

🌟Answer: Yes! ATLAS outperforms prior work with R²(N)=0.88 vs 0.68, and R²(M)=0.82 vs 0.69 for mixture generalization.

2/
October 28, 2025 at 2:03 PM
Good question. @scasper.bsky.social would know best?
October 21, 2025 at 4:11 PM
@seungonekim.bsky.social, who led the effort, is one of the best young AI researchers I’ve ever worked with.

He has done some of the best research on fine-grained, scalable, and human-aligned LLM-as-a-judge evaluation.

➡️ Flask
➡️ Prometheus 1 & 2
➡️ Multilingual Prometheus
➡️ KMMLU
➡️ BigGen Bench
May 6, 2025 at 1:50 PM
arxiv.org
May 6, 2025 at 1:50 PM
Also, check out the MIT Tech Review article: www.technologyreview.com/2024/12/18/1...

Thank you to the team and advisors!

🧵/
This is where the data to build AI comes from
New findings show how the sources of data are concentrating power in the hands of the most powerful tech companies.
www.technologyreview.com
April 14, 2025 at 3:28 PM
We analyzed 4,000 datasets, 800+ sources, 600+ languages, & 67 countries.

Most surprising to me is despite some growth in language/geographic coverage, representation hasn’t significantly improved in a decade.

Check out the paper: arxiv.org/pdf/2412.17847

2/
arxiv.org
April 14, 2025 at 3:28 PM
Reposted by Shayne Longpre
Panel 1: Regulating AI in a Time of Democratic Upheaval starts in approximately 5 minutes.

Panelists: @atoosakz.bsky.social, @randomwalker.bsky.social, @alondra.bsky.social, and Deirdre K. Mulligan.
Moderator: @shaynelongpre.bsky.social.
#AIDemocraticFreedoms
April 10, 2025 at 1:44 PM
🪲AI bug bounties: As AI systems are given more control/autonomy, the surface area for possible flaws grows. Organizations will increasingly rely on community help to identify and address vulnerabilities, multilingually, and across application stacks.

🧵/
April 9, 2025 at 3:25 PM