Lightnews — Scholar-powered news

Adhiraj Ghosh@ACL2025

@adhirajghosh.bsky.social

1.5K followers 420 following 78 posts

ELLIS PhD, University of Tübingen | Data-centric Vision and Language @bethgelab.bsky.social

Website: adhirajghosh.github.io
Twitter: https://x.com/adhiraj_ghosh98

Posts Replies Media Videos

Adhiraj Ghosh@ACL2025

@adhirajghosh.bsky.social

Excited to be in Vienna for #ACL2025 🇦🇹!You'll find @dziadzio.bsky.social and I by our ONEBench poster, so do drop by!

🗓️Wed, July 30, 11-12:30 CET
📍Hall 4/5

I’m also excited to talk about lifelong and personalised benchmarking, data curation and vision-language in general! Let’s connect!

July 27, 2025 at 10:26 PM

Adhiraj Ghosh@ACL2025

@adhirajghosh.bsky.social

Finally, we probe open-ended capabilities by defining a query pool to test, as proof-of-concept, and generating personalised model rankings. Expanding ONEBench can only improve reliability and scale of these queries and we’re excited to extend this framework.
More insights like these in the paper!

December 10, 2024 at 5:50 PM

Adhiraj Ghosh@ACL2025

@adhirajghosh.bsky.social

Let's look under the hood! ONEBench comprises ONEBench-LLM, and ONEBench-LMM: the largest pool of evaluation samples for foundation models(~50K for LLMs and ~600K for LMMs), spanning various domains and tasks. ONEBench will be continually expanded to accommodate more models and datasets.

December 10, 2024 at 5:49 PM

Adhiraj Ghosh@ACL2025

@adhirajghosh.bsky.social

We compare our Plackett-Luce implementation to ELO and ELO-distribution based ranking methods, not only showing superior correlation to the aggregated mean model scores for each test set but also extremely stable correlations to missing datapoints and missing measurements, even up to 95% sparsity!

December 10, 2024 at 5:49 PM

Adhiraj Ghosh@ACL2025

@adhirajghosh.bsky.social

✅ ONEBench uses these rankings and aggregates them using the Plackett-Luce framework: providing an extremely efficient MLE from the aggregated probabilities of individual rankings, resulting in an estimate of the parameters of the strength(or utility value) of models in a ranking.

December 10, 2024 at 5:48 PM

Adhiraj Ghosh@ACL2025

@adhirajghosh.bsky.social

Given a query of interest from a practitioner, we are able to pick relevant samples from the data pool by top-k retrieval in the embedding space or searching through sample-specific metadata.
Aggregating through relevant samples and model performance on them, we obtain our final model rankings.

December 10, 2024 at 5:45 PM

Adhiraj Ghosh@ACL2025

@adhirajghosh.bsky.social

❗Status quo: Benchmarking on large testsets is costly. Static benchmarks fail to use truly held-out data and also can’t probe the ever-evolving capabilities of LLMs/VLMs.

ONEBench mitigates this by re-structuring static benchmarks to accommodate an ever-expanding pool of datasets and models.

December 10, 2024 at 5:45 PM

Adhiraj Ghosh@ACL2025

@adhirajghosh.bsky.social

🚨Looking to test your foundation model on an arbitrary and open-ended set of capabilities, not explicitly captured by static benchmarks? 🚨

Check out ✨ONEBench✨, where we show how sample-level evaluation is the solution.

🔎 arxiv.org/abs/2412.06745

December 10, 2024 at 5:44 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news