Lightnews — Scholar-powered news

Prasanna Mayilvahanan

@prasannamayil.bsky.social

26 followers 35 following 8 posts

PhD student in ML at MPI-IS. Prev Apple.

Interested in robustness at scale and reasoning.

Posts Replies Media Videos

Prasanna Mayilvahanan

@prasannamayil.bsky.social

📉 In contrast, architecture, model size, context length, and optimizer settings have negligible impact. This suggests architectures can be freely optimized for efficiency, while data curation is the real key to strong generalization. 5/8

February 18, 2025 at 2:09 PM

Prasanna Mayilvahanan

@prasannamayil.bsky.social

📊 Key finding: The choice of pretraining data and tokenizer has the largest impact on scaling trends. Even switching from Llama (Transformer) to Mamba (State-Space Model) barely changes loss-to-loss relationships! 4/8

February 18, 2025 at 2:09 PM

Prasanna Mayilvahanan

@prasannamayil.bsky.social

We systematically vary pretraining data, tokenizer, architecture (Llama vs. Mamba), model size, context length, and optimizer settings—evaluating over 6000 model checkpoints—to uncover the true drivers of loss-to-loss scaling laws. 3/8

February 18, 2025 at 2:09 PM

Prasanna Mayilvahanan

@prasannamayil.bsky.social

New preprint out! 🎉

How does LLM training loss translate to downstream performance?

We show that pretraining data and tokenizer shape loss-to-loss scaling, while architecture and other factors play a surprisingly minor role!
brendel-group.github.io/llm-line/ 🧵1/8

February 18, 2025 at 2:09 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news