Besmira Nushi
banner
besmiranushi.bsky.social
Besmira Nushi
@besmiranushi.bsky.social
AI/ML, Responsible AI @Nvidia
Openness for Nemotron includes open evaluation! Eval transparency completes the last puzzle in the E2E repro lifecycle for LLMs. Check out our step-by-step blog on how to use the same pipeline we used for the evaluation of Nemotron 3 Nano through Nemo Evaluator: huggingface.co/blog/nvidia/...
December 17, 2025 at 9:38 PM
Our team is presenting Nemo Evaluator SDK and CoDeC (Data Contamination Detection) this week at #NeurIPS2025 @neuripsconf.bsky.social. Reach out to Meriem Boubdir and Michał Zawalski if you want to talk about any of these or about future research internship roles and full-time positions.
December 1, 2025 at 6:22 PM
CoDeC uses this consistent observation to detect whether a model might have been contaminated with common benchmarks. Through controlled experiments, we show that the method is a reliable detector at the benchmark level and it can be used in early training to analyze accidental contamination. (4/N)
November 5, 2025 at 8:44 AM
In contrary, if the model has already seen the benchmark before (aka the benchmark might have been memorized), model confidence does not improve and it might even drop. (3/N)
November 5, 2025 at 8:44 AM
More precisely, if you want to solve problem X from a benchmark, and you include a few other example problems in context from the same benchmark, models that have never seen the benchmark in training benefit from seeing the in-context examples which increase the model's confidence. (2/N)
November 5, 2025 at 8:44 AM
💡 New research on studying data contamination. Key insight: LLMs leverage in-context examples differently when they have seen a benchmark during training vs. when the benchmark has never been seen in training. (1/N)
November 5, 2025 at 8:44 AM
The problem with chart crimes is not just the distortion of the y axis. It is the erasure of all other competitors from charts (hence they don’t exist), lack of error bars, lack of transparency in tools and code being used for evals…
August 9, 2025 at 8:21 PM
🎉The Phi-4 reasoning models have landed on HF and Azure AI Foundry. The new models are competitive and often outperform much larger frontier models. It is exciting to see the reasoning capabilities extend to more domains beyond math, including algorithmic reasoning, calendar planning, and coding.
May 1, 2025 at 12:50 AM
Come see us in any of the following sessions on model understanding and evaluation! 🔬 #ICLR2025 @msftresearch.bsky.social
April 24, 2025 at 1:38 AM
💡Eureka inference-time scaling insight (Day 8): Reasoning models improve more efficiently upon receiving feedback from themselves on their solutions than conventional models on the most complex tasks.
April 21, 2025 at 8:04 PM
Complaint of the day is that we keep showing numbers on tiny datasets with no error bars. AIME 24 & 25 are ~30 examples each. In our experience, because of high non determinism, accuracy number vary a lot even within different experiments with 5 repeats. We need to do better!
April 17, 2025 at 4:30 PM
💡Eureka inference-time scaling insight (Day 7): There exists untapped potential for improving both conventional models and reasoning models. All models, *including reasoning models*, are able to find a much better inference path when required to sample 5 answers for the same question (best of 5).
April 17, 2025 at 7:50 AM
💡Eureka inference-time scaling insight (Day 6): It is still hard for developers to predict ahead of time how expensive a workload will be, without previous telemetry. This is rooted in inherent and high cost non-determinism associated with inference-time scaling.
April 14, 2025 at 4:15 PM
💡Eureka inference-time scaling insight (Day 5): Higher token consumption does not always indicate higher accuracy across reasoning models. A reasoning model that spends more tokens is not necessarily the most accurate one on a given task.
April 11, 2025 at 8:52 PM
💡Eureka inference-time scaling insight (Day 4): We introduce two NP-hard tasks in Eureka: Traveling Salesman minimal paths, and 3SAT Satisfiability for expressions with 3 literals. These benchmarks can be extremely useful to study how models solve very hard problems with controllable difficulty.
April 10, 2025 at 4:24 PM
💡Eureka inference-time scaling insight (Day 3): Reasoning models show progress even for classical algorithmic problems as Traveling Salesman (optimal paths) or calendar planning. However, benefits diminish with higher complexity. In the hardest problems (TSP), models stop increasing the token length
April 9, 2025 at 11:54 PM
💡Eureka inference-time scaling insight (Day 2): Despite the major updates, reasoning does not benefit all domains equally. E.g., most players report numbers on GPQA to show generalization. However, improvements in GPQA are driven by Physics, with Chemistry and Biology still visibly lagging behind.
April 8, 2025 at 6:52 PM
💡Eureka inference-time scaling insight (Day 1): Reasoning models outperform conventional ones by a large difference, indicating a major update on the state of the art. They generalize to solve simple variants of algorithmic & planning problems as satisfiability, traveling salesman, calendar planning
April 7, 2025 at 3:25 PM
Together with an amazing group of folks at Microsoft Research AI Frontiers: @vidhishab.bsky.social, Jingya Chen, Lingjiao Chen, Shivam Garg, Neel Joshi, Yash Lara, John Langford, Vibhav Vineet, Yue Wu, Safoora Yousefi

aka.ms/eureka-ml-in...

@msftresearch.bsky.social #AI #ML
April 3, 2025 at 6:59 PM
💡Check out our latest Eureka analysis on the benefits of inference-time scaling. It studies 9 state-of-the-art models (conventional & reasoning) on 8 challenging tasks for math and STEM reasoning, calendar planning, NP-hard problems, navigation & spatial reasoning aka.ms/eureka-ml-insights-reasoning
April 3, 2025 at 6:56 PM
I asked ChatGPT to generate images of sw engineers and house cleaners 4-5 times in separate chat windows. Then I did the same thing from a 2nd account to see if I would get some more diversity that way. The pictures speak for themselves, but I want to discuss why default representations matter ➡️
April 1, 2025 at 3:46 AM
Let's draw step by step! GPT-4o image generation cannot yet follow instructions for image generation. At the same time, there are several aspects that have improved significantly when compared to Dall-E including spelling, fluent continued conversation, and some initial notion of feedback taking.
March 30, 2025 at 3:30 PM
Bananas are also difficult fruits. #gpt4o #imagegeneration
March 29, 2025 at 5:48 PM
Wrong but cute #gpt4o #imagegeneration. Cats are humanity's last exam 🐈‍⬛
March 27, 2025 at 5:48 AM
These words are now not welcome in scientific proposals for government funding in the US. Worse, they may/will trigger ungrounded rejection for funding. Now, how is one supposed to write a paper about the Great Barrier Reef, or Polarization of Light, or (god forbid) Women's Health & LGBTQ?
March 12, 2025 at 5:39 AM