Lightnews — Scholar-powered news

Besmira Nushi

@besmiranushi.bsky.social

610 followers 140 following 97 posts

AI/ML, Responsible AI @Nvidia

Posts Replies Media Videos

Besmira Nushi

@besmiranushi.bsky.social

Research led by amazing collaborators at #NVIDIA Michał Zawalski Meriem Boubdir Klaudia Bałazy Pablo Ribalta (8/N)

November 5, 2025 at 8:44 AM

Besmira Nushi

@besmiranushi.bsky.social

The work will be presented at hashtag#NeurIPS 25 workshops: "Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling" (7/N)

🔗 Arxiv link: arxiv.org/abs/2510.270...
👩🏻‍💻 Jupyter notebook: github.com/NVIDIA-NeMo/...
⚒️ Nemo Evaluator: github.com/NVIDIA-NeMo/...

November 5, 2025 at 8:44 AM

Besmira Nushi

@besmiranushi.bsky.social

The fundamental problem with data contamination is not necessarily cheating. It is rather the fact that contamination may camouflage true generalization, which is what is needed in real-world applications. (6/N)

November 5, 2025 at 8:44 AM

Besmira Nushi

@besmiranushi.bsky.social

Beyond detecting contamination, CoDeC is also a useful research tool for better understanding robustness and generalization properties of models and study their sensitivity to over reliance on memorization patterns. (5/N)

November 5, 2025 at 8:44 AM

Besmira Nushi

@besmiranushi.bsky.social

CoDeC uses this consistent observation to detect whether a model might have been contaminated with common benchmarks. Through controlled experiments, we show that the method is a reliable detector at the benchmark level and it can be used in early training to analyze accidental contamination. (4/N)

November 5, 2025 at 8:44 AM

Besmira Nushi

@besmiranushi.bsky.social

In contrary, if the model has already seen the benchmark before (aka the benchmark might have been memorized), model confidence does not improve and it might even drop. (3/N)

November 5, 2025 at 8:44 AM

Besmira Nushi

@besmiranushi.bsky.social

More precisely, if you want to solve problem X from a benchmark, and you include a few other example problems in context from the same benchmark, models that have never seen the benchmark in training benefit from seeing the in-context examples which increase the model's confidence. (2/N)

November 5, 2025 at 8:44 AM

Besmira Nushi

@besmiranushi.bsky.social

…the list continues but point is that a company that hires the best talent in the field definitely knows how to chart. Problem arises when marketing drives and dominates the science, and it is not a single company problem today.

August 9, 2025 at 8:22 PM

Besmira Nushi

@besmiranushi.bsky.social

…coloring new model releases boldly while leaving the older models as blank/white so newer models artificially stand out even if they’re not better, not providing worst case results, not standardizing the max value across charts presented at the same level horizontally…

August 9, 2025 at 8:21 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news