Lightnews — Scholar-powered news

Shayne Longpre

@shaynelongpre.bsky.social

4.3K followers 330 following 91 posts

PhD @ MIT. Prev: Google Deepmind, Apple, Stanford. 🇨🇦 Interests: AI/ML/NLP, Data-centric AI, transparency & societal impact

Posts Replies Media Videos

Shayne Longpre

@shaynelongpre.bsky.social

This work provides the scientific foundation for democratizing scaling laws beyond English.

Full paper: arxiv.org/pdf/2510.22037

Huge thanks to my brilliant co-authors: Sneha, Niklas, I-Hung, Isaac, Sandy, Sercan, Chen-Yu, and Sayna!

arxiv.org

October 28, 2025 at 2:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

Q4: When should you pretrain from scratch vs finetune a multilingual checkpoint?

🌟Answer: We found compute-optimal crossover points for every model size.

Rough rule of thumb: finetune if your compute budget C is < 10^10 x N ^1.54, otherwise pretrain.

8/

October 28, 2025 at 2:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

Remarkably, this means 32% less data per language due to positive cross-lingual transfer—but you still need more total compute.

The curse is real but quantifiable: ϕ=0.11 (capacity penalty), ψ=-0.04 (data benefit from transfer).

7/

October 28, 2025 at 2:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

Q3: How much do you need to scale when adding languages? (The "curse of multilinguality")

🌟Answer: We derived closed-form equations! To go from K to 4K languages while maintaining performance: scale data by 2.74×, model size by 1.4×.

6/

October 28, 2025 at 2:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

🌟Key insight:🌟 shared script beats shared language family for positive transfer!

Languages sharing writing systems (e.g., Latin) show dramatically better transfer (mean: -0.23) vs different scripts (mean: -0.39).

Also important: transfer is often asymmetric—A helping B ≠ B helping A.

5/

October 28, 2025 at 2:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

Q2: Which languages actually help each other during training? And how much?

🌟Answer: We measure this empirically. We built a 38×38 transfer matrix, or 1,444 language pairs—the largest such resource to date.

We highlight the top 5 most beneficial source languages for each target language.

4/

October 28, 2025 at 2:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

ATLAS models cross-lingual transfer explicitly: separating (1) target language data, (2) beneficial transfer languages, and (3) other languages.

Without modeling transfer, existing laws fail on multilingual settings.

3/

October 28, 2025 at 2:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

Q1: Can we build a scaling law that generalizes to unseen model sizes (N), data amounts (D), AND language mixtures (M)?

🌟Answer: Yes! ATLAS outperforms prior work with R²(N)=0.88 vs 0.68, and R²(M)=0.82 vs 0.69 for mixture generalization.

2/

October 28, 2025 at 2:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

Good question. @scasper.bsky.social would know best?

October 21, 2025 at 4:11 PM

Shayne Longpre

@shaynelongpre.bsky.social

@seungonekim.bsky.social, who led the effort, is one of the best young AI researchers I’ve ever worked with.

He has done some of the best research on fine-grained, scalable, and human-aligned LLM-as-a-judge evaluation.

➡️ Flask
➡️ Prometheus 1 & 2
➡️ Multilingual Prometheus
➡️ KMMLU
➡️ BigGen Bench

May 6, 2025 at 1:50 PM

Shayne Longpre

@shaynelongpre.bsky.social

Paper: arxiv.org/pdf/2406.05761
Code: github.com/prometheus-e...

2/

arxiv.org

May 6, 2025 at 1:50 PM

Shayne Longpre

@shaynelongpre.bsky.social

Also, check out the MIT Tech Review article: www.technologyreview.com/2024/12/18/1...

Thank you to the team and advisors!

🧵/

This is where the data to build AI comes from

New findings show how the sources of data are concentrating power in the hands of the most powerful tech companies.

www.technologyreview.com

April 14, 2025 at 3:28 PM

Shayne Longpre

@shaynelongpre.bsky.social

We analyzed 4,000 datasets, 800+ sources, 600+ languages, & 67 countries.

Most surprising to me is despite some growth in language/geographic coverage, representation hasn’t significantly improved in a decade.

Check out the paper: arxiv.org/pdf/2412.17847

2/

arxiv.org

April 14, 2025 at 3:28 PM

Reposted by Shayne Longpre

Knight First Amendment Institute

@knightcolumbia.org

Panel 1: Regulating AI in a Time of Democratic Upheaval starts in approximately 5 minutes.

Panelists: @atoosakz.bsky.social, @randomwalker.bsky.social, @alondra.bsky.social, and Deirdre K. Mulligan.
Moderator: @shaynelongpre.bsky.social.
#AIDemocraticFreedoms

April 10, 2025 at 1:44 PM

Shayne Longpre

@shaynelongpre.bsky.social

🪲AI bug bounties: As AI systems are given more control/autonomy, the surface area for possible flaws grows. Organizations will increasingly rely on community help to identify and address vulnerabilities, multilingually, and across application stacks.

🧵/

April 9, 2025 at 3:25 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news