Ameya P.
@bayesiankitten.bsky.social
Postdoctoral Researcher @ Bethgelab, University of Tübingen
Benchmarking | LLM Agents | Data-Centric ML | Continual Learning | Unlearning
drimpossible.github.io
Benchmarking | LLM Agents | Data-Centric ML | Continual Learning | Unlearning
drimpossible.github.io
Reposted by Ameya P.
🚀 A new era in European #AIresearch begins!
ELLIOT is a €25M #HorizonEurope project launching July 2025 to build open, trustworthy Multimodal Generalist Foundation Models.
30 partners, 12 countries, EU values.
🔗 Press release: apigateway.agilitypr.com/distribution...
ELLIOT is a €25M #HorizonEurope project launching July 2025 to build open, trustworthy Multimodal Generalist Foundation Models.
30 partners, 12 countries, EU values.
🔗 Press release: apigateway.agilitypr.com/distribution...
June 24, 2025 at 6:46 AM
🚀 A new era in European #AIresearch begins!
ELLIOT is a €25M #HorizonEurope project launching July 2025 to build open, trustworthy Multimodal Generalist Foundation Models.
30 partners, 12 countries, EU values.
🔗 Press release: apigateway.agilitypr.com/distribution...
ELLIOT is a €25M #HorizonEurope project launching July 2025 to build open, trustworthy Multimodal Generalist Foundation Models.
30 partners, 12 countries, EU values.
🔗 Press release: apigateway.agilitypr.com/distribution...
Reposted by Ameya P.
🚀 Never miss a beat in science again!
📬 Scholar Inbox is your personal assistant for staying up to date with your literature. It includes: visual summaries, collections, search and a conference planner.
Check out our white paper: arxiv.org/abs/2504.08385
#OpenScience #AI #RecommenderSystems
📬 Scholar Inbox is your personal assistant for staying up to date with your literature. It includes: visual summaries, collections, search and a conference planner.
Check out our white paper: arxiv.org/abs/2504.08385
#OpenScience #AI #RecommenderSystems
April 14, 2025 at 11:04 AM
🚀 Never miss a beat in science again!
📬 Scholar Inbox is your personal assistant for staying up to date with your literature. It includes: visual summaries, collections, search and a conference planner.
Check out our white paper: arxiv.org/abs/2504.08385
#OpenScience #AI #RecommenderSystems
📬 Scholar Inbox is your personal assistant for staying up to date with your literature. It includes: visual summaries, collections, search and a conference planner.
Check out our white paper: arxiv.org/abs/2504.08385
#OpenScience #AI #RecommenderSystems
Reposted by Ameya P.
🧵1/ 🚨 New paper: A Sober Look at Progress in Language Model Reasoning
We re-evaluate recent SFT and RL models for mathematical reasoning and find most gains vanish under rigorous, multi-seed, standardized evaluation.
📊 bethgelab.github.io/sober-reason...
📄 arxiv.org/abs/2504.07086
We re-evaluate recent SFT and RL models for mathematical reasoning and find most gains vanish under rigorous, multi-seed, standardized evaluation.
📊 bethgelab.github.io/sober-reason...
📄 arxiv.org/abs/2504.07086
April 10, 2025 at 3:36 PM
🧵1/ 🚨 New paper: A Sober Look at Progress in Language Model Reasoning
We re-evaluate recent SFT and RL models for mathematical reasoning and find most gains vanish under rigorous, multi-seed, standardized evaluation.
📊 bethgelab.github.io/sober-reason...
📄 arxiv.org/abs/2504.07086
We re-evaluate recent SFT and RL models for mathematical reasoning and find most gains vanish under rigorous, multi-seed, standardized evaluation.
📊 bethgelab.github.io/sober-reason...
📄 arxiv.org/abs/2504.07086
Reposted by Ameya P.
Hochlehnert, Bhatnagar, Udandarao, Albanie, Prabhu, Bethge: A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility https://arxiv.org/abs/2504.07086 https://arxiv.org/pdf/2504.07086 https://arxiv.org/html/2504.07086
April 10, 2025 at 6:08 AM
Hochlehnert, Bhatnagar, Udandarao, Albanie, Prabhu, Bethge: A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility https://arxiv.org/abs/2504.07086 https://arxiv.org/pdf/2504.07086 https://arxiv.org/html/2504.07086
Great work! A much-needed upgrade for continual learning datasets—excited to see progress on long-timespan tasks beyond classification. Deets below👇
🧠 Keeping LLMs factually up to date is a common motivation for knowledge editing.
But what would it actually take to support this in practice at the scale and speed the real world demands?
We explore this question and really push the limits of lifelong knowledge editing in the wild.
👇
But what would it actually take to support this in practice at the scale and speed the real world demands?
We explore this question and really push the limits of lifelong knowledge editing in the wild.
👇
April 10, 2025 at 5:32 AM
Great work! A much-needed upgrade for continual learning datasets—excited to see progress on long-timespan tasks beyond classification. Deets below👇
Deadline extended to March 19 for the EVAL-FoMo workshop @cvprconference.bsky.social! We welcome submissions (incl. published papers) analyzing emerging capabilities & limits in visual foundation models.
Details: sites.google.com/view/eval-fo...
#CVPR2025
Details: sites.google.com/view/eval-fo...
#CVPR2025
March 12, 2025 at 12:20 PM
Deadline extended to March 19 for the EVAL-FoMo workshop @cvprconference.bsky.social! We welcome submissions (incl. published papers) analyzing emerging capabilities & limits in visual foundation models.
Details: sites.google.com/view/eval-fo...
#CVPR2025
Details: sites.google.com/view/eval-fo...
#CVPR2025
LMs excel at solving problems (~48% success) but falter at debunking them (<9% counterexample rate)!
Could form an AI Brandolini's Law: "Capability needed to refute bullshit is far larger than that needed to generate it"
Could form an AI Brandolini's Law: "Capability needed to refute bullshit is far larger than that needed to generate it"
AI can generate correct-seeming hypotheses (and papers!). Brandolini's law states BS is harder to refute than generate. Can LMs falsify incorrect solutions? o3-mini (high) scores just 9% on our new benchmark REFUTE. Verification is not necessarily easier than generation 🧵
February 28, 2025 at 7:17 PM
LMs excel at solving problems (~48% success) but falter at debunking them (<9% counterexample rate)!
Could form an AI Brandolini's Law: "Capability needed to refute bullshit is far larger than that needed to generate it"
Could form an AI Brandolini's Law: "Capability needed to refute bullshit is far larger than that needed to generate it"
Reposted by Ameya P.
AI can generate correct-seeming hypotheses (and papers!). Brandolini's law states BS is harder to refute than generate. Can LMs falsify incorrect solutions? o3-mini (high) scores just 9% on our new benchmark REFUTE. Verification is not necessarily easier than generation 🧵
February 28, 2025 at 6:13 PM
AI can generate correct-seeming hypotheses (and papers!). Brandolini's law states BS is harder to refute than generate. Can LMs falsify incorrect solutions? o3-mini (high) scores just 9% on our new benchmark REFUTE. Verification is not necessarily easier than generation 🧵
Reposted by Ameya P.
🚀 Call for Papers – CVPR 3rd Workshop on Multi-Modal Foundation Models (MMFM)
@cvprconference.bsky.social ! 🚀
🔍 Topics: Multi-modal learning, vision-language, audio-visual, and more!
📅 Deadline: March 14, 2025
📝 Submission: cmt3.research.microsoft.com/MMFM2025
🌐 sites.google.com/view/mmfm3rd...
@cvprconference.bsky.social ! 🚀
🔍 Topics: Multi-modal learning, vision-language, audio-visual, and more!
📅 Deadline: March 14, 2025
📝 Submission: cmt3.research.microsoft.com/MMFM2025
🌐 sites.google.com/view/mmfm3rd...
Conference Management Toolkit - Login
Microsoft's Conference Management Toolkit is a hosted academic conference management system. Modern interface, high scalability, extensive features and outstanding support are the signatures of Micros...
cmt3.research.microsoft.com
February 19, 2025 at 2:16 PM
🚀 Call for Papers – CVPR 3rd Workshop on Multi-Modal Foundation Models (MMFM)
@cvprconference.bsky.social ! 🚀
🔍 Topics: Multi-modal learning, vision-language, audio-visual, and more!
📅 Deadline: March 14, 2025
📝 Submission: cmt3.research.microsoft.com/MMFM2025
🌐 sites.google.com/view/mmfm3rd...
@cvprconference.bsky.social ! 🚀
🔍 Topics: Multi-modal learning, vision-language, audio-visual, and more!
📅 Deadline: March 14, 2025
📝 Submission: cmt3.research.microsoft.com/MMFM2025
🌐 sites.google.com/view/mmfm3rd...
Reposted by Ameya P.
New preprint out! 🎉
How does LLM training loss translate to downstream performance?
We show that pretraining data and tokenizer shape loss-to-loss scaling, while architecture and other factors play a surprisingly minor role!
brendel-group.github.io/llm-line/ 🧵1/8
How does LLM training loss translate to downstream performance?
We show that pretraining data and tokenizer shape loss-to-loss scaling, while architecture and other factors play a surprisingly minor role!
brendel-group.github.io/llm-line/ 🧵1/8
February 18, 2025 at 2:09 PM
New preprint out! 🎉
How does LLM training loss translate to downstream performance?
We show that pretraining data and tokenizer shape loss-to-loss scaling, while architecture and other factors play a surprisingly minor role!
brendel-group.github.io/llm-line/ 🧵1/8
How does LLM training loss translate to downstream performance?
We show that pretraining data and tokenizer shape loss-to-loss scaling, while architecture and other factors play a surprisingly minor role!
brendel-group.github.io/llm-line/ 🧵1/8
CuratedThoughts: Data curation focus for RL post-training! (Update 1) 🚀
25% of Openthoughts-114k-math filtered — issues included proofs, missing figures, and multiple questions with one answer.
Check out work by
@ahochlehnert.bsky.social & @hrdkbhatnagar.bsky.social
below 👇
25% of Openthoughts-114k-math filtered — issues included proofs, missing figures, and multiple questions with one answer.
Check out work by
@ahochlehnert.bsky.social & @hrdkbhatnagar.bsky.social
below 👇
CuratedThoughts: Data Curation for RL Datasets 🚀
Since DeepSeek-R1 introduced reasoning-based RL, datasets like Open-R1 & OpenThoughts emerged for fine-tuning & GRPO. Our deep dive found major flaws — 25% of OpenThoughts needed elimination by data curation.
Here's why 👇🧵
Since DeepSeek-R1 introduced reasoning-based RL, datasets like Open-R1 & OpenThoughts emerged for fine-tuning & GRPO. Our deep dive found major flaws — 25% of OpenThoughts needed elimination by data curation.
Here's why 👇🧵
February 17, 2025 at 6:30 PM
CuratedThoughts: Data curation focus for RL post-training! (Update 1) 🚀
25% of Openthoughts-114k-math filtered — issues included proofs, missing figures, and multiple questions with one answer.
Check out work by
@ahochlehnert.bsky.social & @hrdkbhatnagar.bsky.social
below 👇
25% of Openthoughts-114k-math filtered — issues included proofs, missing figures, and multiple questions with one answer.
Check out work by
@ahochlehnert.bsky.social & @hrdkbhatnagar.bsky.social
below 👇
Reposted by Ameya P.
Our 2nd Workshop on Emergent Visual Abilities and Limits of Foundation Models (EVAL-FoMo) is accepting submissions. We are looking forward to talks by our amazing speakers that include @saining.bsky.social, @aidanematzadeh.bsky.social, @lisadunlap.bsky.social, and @yukimasano.bsky.social. #CVPR2025
🔥 #CVPR2025 Submit your cool papers to Workshop on
Emergent Visual Abilities and Limits of Foundation Models 📷📷🧠🚀✨
sites.google.com/view/eval-fo...
Submission Deadline: March 12th!
Emergent Visual Abilities and Limits of Foundation Models 📷📷🧠🚀✨
sites.google.com/view/eval-fo...
Submission Deadline: March 12th!
EVAL-FoMo 2
A Vision workshop on Evaluations and Analysis
sites.google.com
February 13, 2025 at 4:02 PM
Our 2nd Workshop on Emergent Visual Abilities and Limits of Foundation Models (EVAL-FoMo) is accepting submissions. We are looking forward to talks by our amazing speakers that include @saining.bsky.social, @aidanematzadeh.bsky.social, @lisadunlap.bsky.social, and @yukimasano.bsky.social. #CVPR2025
🔥 #CVPR2025 Submit your cool papers to Workshop on
Emergent Visual Abilities and Limits of Foundation Models 📷📷🧠🚀✨
sites.google.com/view/eval-fo...
Submission Deadline: March 12th!
Emergent Visual Abilities and Limits of Foundation Models 📷📷🧠🚀✨
sites.google.com/view/eval-fo...
Submission Deadline: March 12th!
EVAL-FoMo 2
A Vision workshop on Evaluations and Analysis
sites.google.com
February 12, 2025 at 2:45 PM
🔥 #CVPR2025 Submit your cool papers to Workshop on
Emergent Visual Abilities and Limits of Foundation Models 📷📷🧠🚀✨
sites.google.com/view/eval-fo...
Submission Deadline: March 12th!
Emergent Visual Abilities and Limits of Foundation Models 📷📷🧠🚀✨
sites.google.com/view/eval-fo...
Submission Deadline: March 12th!
LMs are used for annotation, evaluation and distillation! We identify critical issues!
LMs of a similar capability class (not model family tho!) behave similarly and this skews oversight far more than I expected.
Check the 4-in-1 mega paper below to 👀 how 👇
LMs of a similar capability class (not model family tho!) behave similarly and this skews oversight far more than I expected.
Check the 4-in-1 mega paper below to 👀 how 👇
🚨Great Models Think Alike and this Undermines AI Oversight🚨
New paper quantifies LM similarity
(1) LLM-as-a-judge favor more similar models🤥
(2) Complementary knowledge benefits Weak-to-Strong Generalization☯️
(3) More capable models have more correlated failures 📈🙀
🧵👇
New paper quantifies LM similarity
(1) LLM-as-a-judge favor more similar models🤥
(2) Complementary knowledge benefits Weak-to-Strong Generalization☯️
(3) More capable models have more correlated failures 📈🙀
🧵👇
February 7, 2025 at 10:09 PM
LMs are used for annotation, evaluation and distillation! We identify critical issues!
LMs of a similar capability class (not model family tho!) behave similarly and this skews oversight far more than I expected.
Check the 4-in-1 mega paper below to 👀 how 👇
LMs of a similar capability class (not model family tho!) behave similarly and this skews oversight far more than I expected.
Check the 4-in-1 mega paper below to 👀 how 👇
New Work: RanDumb!🚀
Poster @NeurIPS, East Hall #1910- come say hi👋
Core claim: Random representations Outperform Online Continual Learning Methods!
How: We replace the deep network by a *random projection* and linear clf, yet outperform all OCL methods by huge margins [1/n]
Poster @NeurIPS, East Hall #1910- come say hi👋
Core claim: Random representations Outperform Online Continual Learning Methods!
How: We replace the deep network by a *random projection* and linear clf, yet outperform all OCL methods by huge margins [1/n]
December 13, 2024 at 6:50 PM
New Work: RanDumb!🚀
Poster @NeurIPS, East Hall #1910- come say hi👋
Core claim: Random representations Outperform Online Continual Learning Methods!
How: We replace the deep network by a *random projection* and linear clf, yet outperform all OCL methods by huge margins [1/n]
Poster @NeurIPS, East Hall #1910- come say hi👋
Core claim: Random representations Outperform Online Continual Learning Methods!
How: We replace the deep network by a *random projection* and linear clf, yet outperform all OCL methods by huge margins [1/n]
Reposted by Ameya P.
The Practitioner's Guide to Continual Multimodal Pretraining @dziadzio.bsky.social @confusezius.bsky.social @vishaalurao.bsky.social @bayesiankitten.bsky.social
December 12, 2024 at 2:20 AM
The Practitioner's Guide to Continual Multimodal Pretraining @dziadzio.bsky.social @confusezius.bsky.social @vishaalurao.bsky.social @bayesiankitten.bsky.social
Breaking the 8-model merge limit was tough, but we scaled to merging 200+ models! The secret? Iterative finetuning + merging *over time*.
The time axis unlocks scalable mergeability. Merging has surprising scaling gains across size & compute budgets.
All the gory details ⬇️
The time axis unlocks scalable mergeability. Merging has surprising scaling gains across size & compute budgets.
All the gory details ⬇️
📄 New Paper: "How to Merge Your Multimodal Models Over Time?"
arxiv.org/abs/2412.06712
Model merging assumes all finetuned models are available at once. But what if they need to be created over time?
We study Temporal Model Merging through the TIME framework to find out!
🧵
arxiv.org/abs/2412.06712
Model merging assumes all finetuned models are available at once. But what if they need to be created over time?
We study Temporal Model Merging through the TIME framework to find out!
🧵
How to Merge Your Multimodal Models Over Time?
Model merging combines multiple expert models - finetuned from a base foundation model on diverse tasks and domains - into a single, more capable model. However, most existing model merging approaches...
arxiv.org
December 11, 2024 at 6:16 PM
Breaking the 8-model merge limit was tough, but we scaled to merging 200+ models! The secret? Iterative finetuning + merging *over time*.
The time axis unlocks scalable mergeability. Merging has surprising scaling gains across size & compute budgets.
All the gory details ⬇️
The time axis unlocks scalable mergeability. Merging has surprising scaling gains across size & compute budgets.
All the gory details ⬇️
How do we benchmark the vast capabilities of foundation models? Introducing ONEBench – a unifying benchmark to test them all, led by
@adhirajghosh.bsky.social and
@dziadzio.bsky.social!⬇️
Sample-level benchmarks could be the new generation- reusable, recombinable & evaluate lots of capabilities!
@adhirajghosh.bsky.social and
@dziadzio.bsky.social!⬇️
Sample-level benchmarks could be the new generation- reusable, recombinable & evaluate lots of capabilities!
🚨Looking to test your foundation model on an arbitrary and open-ended set of capabilities, not explicitly captured by static benchmarks? 🚨
Check out ✨ONEBench✨, where we show how sample-level evaluation is the solution.
🔎 arxiv.org/abs/2412.06745
Check out ✨ONEBench✨, where we show how sample-level evaluation is the solution.
🔎 arxiv.org/abs/2412.06745
December 10, 2024 at 6:39 PM
How do we benchmark the vast capabilities of foundation models? Introducing ONEBench – a unifying benchmark to test them all, led by
@adhirajghosh.bsky.social and
@dziadzio.bsky.social!⬇️
Sample-level benchmarks could be the new generation- reusable, recombinable & evaluate lots of capabilities!
@adhirajghosh.bsky.social and
@dziadzio.bsky.social!⬇️
Sample-level benchmarks could be the new generation- reusable, recombinable & evaluate lots of capabilities!
Come chat with us @ NeurIPS for hot takes on the future of continual learning with foundation models!
😵💫 Continually pretraining large multimodal models to keep them up-to-date all-the-time is tough, covering everything from adapters, merging, meta-scheduling to data design and more!
So I'm really happy to present our large-scale study at #NeurIPS2024!
Come drop by to talk about all that and more!
So I'm really happy to present our large-scale study at #NeurIPS2024!
Come drop by to talk about all that and more!
December 10, 2024 at 6:37 PM
Come chat with us @ NeurIPS for hot takes on the future of continual learning with foundation models!
Reposted by Ameya P.
The list of accepted workshops for ICLR 2025 is available at openreview.net/group?id=ICL...
@iclr-conf.bsky.social
We received 120 wonderful proposals, with 40 selected as workshops.
@iclr-conf.bsky.social
We received 120 wonderful proposals, with 40 selected as workshops.
ICLR 2025 Workshop Proposals
Welcome to the OpenReview homepage for ICLR 2025 Workshop Proposals
openreview.net
December 3, 2024 at 4:29 PM
The list of accepted workshops for ICLR 2025 is available at openreview.net/group?id=ICL...
@iclr-conf.bsky.social
We received 120 wonderful proposals, with 40 selected as workshops.
@iclr-conf.bsky.social
We received 120 wonderful proposals, with 40 selected as workshops.