Besmira Nushi
banner
besmiranushi.bsky.social
Besmira Nushi
@besmiranushi.bsky.social
AI/ML, Responsible AI @Nvidia
Reposted by Besmira Nushi
I agree that emotional addiction to chatbots is the number one risk of AI today. Here is a gift link to an important OpEd in the NYTimes:
www.nytimes.com/2025/11/17/o...
Opinion | The Sad and Dangerous Reality Behind ‘Her’
www.nytimes.com
November 20, 2025 at 5:36 AM
Reposted by Besmira Nushi
🧠⚙️ Interested in decision theory+cogsci meets AI? Want to create methods for rigorously designing & evaluating human-AI workflows?

I'm recruiting PhDs to work on:
🎯 Stat foundations of multi-agent collaboration
🌫️ Model uncertainty & meta-cognition
🔎 Interpretability
💬 LLMs in behavioral science
November 5, 2025 at 4:40 PM
💡 New research on studying data contamination. Key insight: LLMs leverage in-context examples differently when they have seen a benchmark during training vs. when the benchmark has never been seen in training. (1/N)
November 5, 2025 at 8:44 AM
Reposted by Besmira Nushi
The blog post is available: blog.arxiv.org/2025/10/31/a...
November 1, 2025 at 5:06 PM
NeMo Evaluator SDK — the platform we use at NVIDIA to benchmark LLMs, multimodal models, and agents — is now open source. It’s built for reproducibility, scalability, and transparency, with 100 + benchmarks across 18 open-source harnesses and full containerized execution. github.com/NVIDIA-NeMo/...
GitHub - NVIDIA-NeMo/Evaluator: Open-source library for scalable, reproducible evaluation of AI models and benchmarks.
Open-source library for scalable, reproducible evaluation of AI models and benchmarks. - NVIDIA-NeMo/Evaluator
github.com
October 28, 2025 at 8:59 PM
When to call it quits in LLM reasoning? 🛑

‪Martina's internship project suggests trace monitoring metrics and classifiers that can detect when an LLM reasoning trace is going to fail in mid way. The approach saves up to 70% of token usage, and it even helps with increasing accuracy by 2%-3%.
Can we predict which reasoning paths will succeed before seeing the answer? 🤔

Our new paper (arxiv.org/abs/2510.10494) proposes latent-trajectory signals from LLMs' hidden states to identify high-quality reasoning, cutting inference costs by up to 70% while maintaining accuracy
Tracing the Traces: Latent Temporal Signals for Efficient and Accurate Reasoning
Reasoning models improve their problem-solving ability through inference-time scaling, allocating more compute via longer token budgets. Identifying which reasoning traces are likely to succeed remain...
arxiv.org
October 22, 2025 at 10:39 PM
Federal research funding works. It’s not an expense–it’s an investment. It’s not overhead–it’s a down payment on the future. - Eric Horvitz, Margaret Martonosi, Moshe Y. Vardi, and James Larus in CACM cacm.acm.org/opinion/keep... @erichorvitz.bsky.social
Keeping the Dream Alive: The Power and Promise of Federally Funded Research – Communications of the ACM
cacm.acm.org
September 12, 2025 at 5:28 AM
Our team in Zurich and EMEA is hiring Deep Learning Engineers for LLM Accuracy Evaluation and Analysis. Ideal candidates should have an inquisitive 🔬approach to evaluation and with best engineering practices for building reusable open source tools. www.linkedin.com/jobs/view/42...
NVIDIA hiring Deep Learning Engineer, LLM Accuracy Evaluation in Switzerland | LinkedIn
Posted 5:30:48 AM. We are seeking senior engineers to pioneer new methodologies for accurately assessing the…See this and similar jobs on LinkedIn.
www.linkedin.com
September 5, 2025 at 10:59 AM
Reposted by Besmira Nushi
The Diary of Anne Frank is among the hundreds of books banned in Florida this year. When I was in school, it was required reading. (Guardian)
August 24, 2025 at 8:53 PM
The problem with chart crimes is not just the distortion of the y axis. It is the erasure of all other competitors from charts (hence they don’t exist), lack of error bars, lack of transparency in tools and code being used for evals…
August 9, 2025 at 8:21 PM
I have a single question. Why doesn’t OpenAI compare with competitors in their evals? No Gemini, no Claude, no open source models…
August 8, 2025 at 6:50 PM
Reposted by Besmira Nushi
hey wasn't this the same company that made a beautiful shiny "research" post about how AI evals should include error bars or something like that. or did they decide the CLT didn't apply here
August 6, 2025 at 3:20 AM
Reposted by Besmira Nushi
New work from my team! arxiv.org/abs/2507.12950
Intersecting mechanistic interpretability and health AI 😎

We trained and interpreted sparse autoencoders on MAIRA-2, our radiology MLLM. We found a range of human-interpretable radiology reporting concepts, but also many uninterpretable SAE features.
Insights into a radiology-specialised multimodal large language model with sparse autoencoders
Interpretability can improve the safety, transparency and trust of AI models, which is especially important in healthcare applications where decisions often carry significant consequences. Mechanistic...
arxiv.org
July 18, 2025 at 9:30 AM
Reposted by Besmira Nushi
If you're at @icmlconf.bsky.social this week, come check out our poster on "Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge" presented by the amazing @afedercooper.bsky.social from 11:30am--1:30pm PDT on Weds!!! icml.cc/virtual/2025...
ICML Poster Position: Evaluating Generative AI Systems Is a Social Science Measurement ChallengeICML 2025
icml.cc
July 15, 2025 at 6:35 PM
Reposted by Besmira Nushi
📢 Webinar - 6/18 at 9am PST!
Stop re-running complex recursive queries when your graph data changes. Feldera incrementally evaluates recursive graph computations. Learn to easily build these mechanisms with #SQL, without the hassle of constant recomputation.
tinyurl.com/rb5my7d8
June 11, 2025 at 10:13 PM
I only got to listen to this today. A lot of people in my network including myself have felt exactly this, for years. The fear that for some obscure reason, your paperwork and you may not be enough for this country, even in “normal” times.

youtube.com/shorts/IF3bz...
let me explain what being on a student visa is actually like
YouTube video by Representative Pramila Jayapal
youtube.com
June 11, 2025 at 3:53 PM
Reposted by Besmira Nushi
We’ll be at the #Databricks Data + AI Summit in SF next week (6/9–12).

If you’re around and want to chat about how incremental computing can make your #SparkSQL workloads go from hours to seconds — let’s connect.

Grab some time here: calendly.com/matt-feldera...

#DataAISummit #DataEngineering
June 5, 2025 at 8:47 PM
Reposted by Besmira Nushi
Tired: "BS"

Wired: "Vibe citing"

www.nytimes.com/2025/05/29/w...
White House Health Report Included Fake Citations
www.nytimes.com
May 30, 2025 at 8:15 PM
📌You can now find all the evaluation logs from our inference-time scaling report and the Phi-4 reasoning technical report at huggingface.co/datasets/mic.... The evaluation code for the reasoning benchmarks can also be found in the main branch of Eureka ML Insights at github.com/microsoft/eu....
microsoft/Eureka-Bench-Logs · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
May 27, 2025 at 4:02 PM
🎉The Phi-4 reasoning models have landed on HF and Azure AI Foundry. The new models are competitive and often outperform much larger frontier models. It is exciting to see the reasoning capabilities extend to more domains beyond math, including algorithmic reasoning, calendar planning, and coding.
May 1, 2025 at 12:50 AM
Reposted by Besmira Nushi
Re: The Chatbot Arena Illusion

Every eval chokes under hill climbing. If we're lucky, there’s an early phase where *real* learning (both model and community) can occur. I'd argue that a benchmark’s value lies entirely in that window. So the real question is what did we learn?
April 30, 2025 at 4:38 PM
All Eureka inference-time scaling insights are now available here: www.microsoft.com/en-us/resear... It was fun sharing these and more together with Vidhisha Balachandran @vidhishab.bsky.social and Vibhav Vineet at #ICLR2025.
Eureka Inference-Time Scaling Insights: Where We Stand and What Lies Ahead - Microsoft Research
Understanding and measuring the potential of inference-time scaling for reasoning. The new Eureka study tests nine state-of-the-art models on eight diverse reasoning tasks.
www.microsoft.com
April 29, 2025 at 3:36 PM
Come see us in any of the following sessions on model understanding and evaluation! 🔬 #ICLR2025 @msftresearch.bsky.social
April 24, 2025 at 1:38 AM
💡Eureka inference-time scaling insight (Day 8): Reasoning models improve more efficiently upon receiving feedback from themselves on their solutions than conventional models on the most complex tasks.
April 21, 2025 at 8:04 PM