Lightnews — Scholar-powered news

Reposted by Besmira Nushi

Thomas Dietterich

@tdietterich.bsky.social

I agree that emotional addiction to chatbots is the number one risk of AI today. Here is a gift link to an important OpEd in the NYTimes:
www.nytimes.com/2025/11/17/o...

Opinion | The Sad and Dangerous Reality Behind ‘Her’

www.nytimes.com

November 20, 2025 at 5:36 AM

Reposted by Besmira Nushi

Jessica Hullman

@jessicahullman.bsky.social

🧠⚙️ Interested in decision theory+cogsci meets AI? Want to create methods for rigorously designing & evaluating human-AI workflows?

I'm recruiting PhDs to work on:
🎯 Stat foundations of multi-agent collaboration
🌫️ Model uncertainty & meta-cognition
🔎 Interpretability
💬 LLMs in behavioral science

November 5, 2025 at 4:40 PM

Besmira Nushi

@besmiranushi.bsky.social

💡 New research on studying data contamination. Key insight: LLMs leverage in-context examples differently when they have seen a benchmark during training vs. when the benchmark has never been seen in training. (1/N)

November 5, 2025 at 8:44 AM

Reposted by Besmira Nushi

Thomas Dietterich

@tdietterich.bsky.social

The blog post is available: blog.arxiv.org/2025/10/31/a...

November 1, 2025 at 5:06 PM

Besmira Nushi

@besmiranushi.bsky.social

NeMo Evaluator SDK — the platform we use at NVIDIA to benchmark LLMs, multimodal models, and agents — is now open source. It’s built for reproducibility, scalability, and transparency, with 100 + benchmarks across 18 open-source harnesses and full containerized execution. github.com/NVIDIA-NeMo/...

GitHub - NVIDIA-NeMo/Evaluator: Open-source library for scalable, reproducible evaluation of AI models and benchmarks.

Open-source library for scalable, reproducible evaluation of AI models and benchmarks. - NVIDIA-NeMo/Evaluator

github.com

October 28, 2025 at 8:59 PM

Besmira Nushi

@besmiranushi.bsky.social

When to call it quits in LLM reasoning? 🛑

‪Martina's internship project suggests trace monitoring metrics and classifiers that can detect when an LLM reasoning trace is going to fail in mid way. The approach saves up to 70% of token usage, and it even helps with increasing accuracy by 2%-3%.

Martina Vilas @martinagvilas.bsky.social · Oct 22

Can we predict which reasoning paths will succeed before seeing the answer? 🤔

Our new paper (arxiv.org/abs/2510.10494) proposes latent-trajectory signals from LLMs' hidden states to identify high-quality reasoning, cutting inference costs by up to 70% while maintaining accuracy

Tracing the Traces: Latent Temporal Signals for Efficient and Accurate Reasoning

Reasoning models improve their problem-solving ability through inference-time scaling, allocating more compute via longer token budgets. Identifying which reasoning traces are likely to succeed remain...

arxiv.org

October 22, 2025 at 10:39 PM

Besmira Nushi

@besmiranushi.bsky.social

Federal research funding works. It’s not an expense–it’s an investment. It’s not overhead–it’s a down payment on the future. - Eric Horvitz, Margaret Martonosi, Moshe Y. Vardi, and James Larus in CACM cacm.acm.org/opinion/keep... @erichorvitz.bsky.social

Keeping the Dream Alive: The Power and Promise of Federally Funded Research – Communications of the ACM

cacm.acm.org

September 12, 2025 at 5:28 AM

Besmira Nushi

@besmiranushi.bsky.social

Our team in Zurich and EMEA is hiring Deep Learning Engineers for LLM Accuracy Evaluation and Analysis. Ideal candidates should have an inquisitive 🔬approach to evaluation and with best engineering practices for building reusable open source tools. www.linkedin.com/jobs/view/42...

NVIDIA hiring Deep Learning Engineer, LLM Accuracy Evaluation in Switzerland | LinkedIn

Posted 5:30:48 AM. We are seeking senior engineers to pioneer new methodologies for accurately assessing the…See this and similar jobs on LinkedIn.

www.linkedin.com

September 5, 2025 at 10:59 AM

Reposted by Besmira Nushi

The War Monitor

@warmonitor.net

The Diary of Anne Frank is among the hundreds of books banned in Florida this year. When I was in school, it was required reading. (Guardian)

August 24, 2025 at 8:53 PM

Besmira Nushi

@besmiranushi.bsky.social

The problem with chart crimes is not just the distortion of the y axis. It is the erasure of all other competitors from charts (hence they don’t exist), lack of error bars, lack of transparency in tools and code being used for evals…

August 9, 2025 at 8:21 PM

Besmira Nushi

@besmiranushi.bsky.social

I have a single question. Why doesn’t OpenAI compare with competitors in their evals? No Gemini, no Claude, no open source models…

August 8, 2025 at 6:50 PM

Reposted by Besmira Nushi

jessica dai

@jessica.bsky.social

hey wasn't this the same company that made a beautiful shiny "research" post about how AI evals should include error bars or something like that. or did they decide the CLT didn't apply here

August 6, 2025 at 3:20 AM

Reposted by Besmira Nushi

Stephanie Hyland

@hylandsl.bsky.social

New work from my team! arxiv.org/abs/2507.12950
Intersecting mechanistic interpretability and health AI 😎

We trained and interpreted sparse autoencoders on MAIRA-2, our radiology MLLM. We found a range of human-interpretable radiology reporting concepts, but also many uninterpretable SAE features.

Insights into a radiology-specialised multimodal large language model with sparse autoencoders

Interpretability can improve the safety, transparency and trust of AI models, which is especially important in healthcare applications where decisions often carry significant consequences. Mechanistic...

arxiv.org

July 18, 2025 at 9:30 AM

Reposted by Besmira Nushi

Hanna Wallach

@hannawallach.bsky.social

If you're at @icmlconf.bsky.social this week, come check out our poster on "Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge" presented by the amazing @afedercooper.bsky.social from 11:30am--1:30pm PDT on Weds!!! icml.cc/virtual/2025...

ICML Poster Position: Evaluating Generative AI Systems Is a Social Science Measurement ChallengeICML 2025

icml.cc

July 15, 2025 at 6:35 PM

Reposted by Besmira Nushi

Feldera

@feldera.bsky.social

📢 Webinar - 6/18 at 9am PST!
Stop re-running complex recursive queries when your graph data changes. Feldera incrementally evaluates recursive graph computations. Learn to easily build these mechanisms with #SQL, without the hassle of constant recomputation.
tinyurl.com/rb5my7d8

June 11, 2025 at 10:13 PM

Besmira Nushi

@besmiranushi.bsky.social

I only got to listen to this today. A lot of people in my network including myself have felt exactly this, for years. The fear that for some obscure reason, your paperwork and you may not be enough for this country, even in “normal” times.

youtube.com/shorts/IF3bz...

let me explain what being on a student visa is actually like

YouTube video by Representative Pramila Jayapal

youtube.com

June 11, 2025 at 3:53 PM

Reposted by Besmira Nushi

Feldera

@feldera.bsky.social

We’ll be at the #Databricks Data + AI Summit in SF next week (6/9–12).

If you’re around and want to chat about how incremental computing can make your #SparkSQL workloads go from hours to seconds — let’s connect.

Grab some time here: calendly.com/matt-feldera...

#DataAISummit #DataEngineering

June 5, 2025 at 8:47 PM

Reposted by Besmira Nushi

Melanie Mitchell

@melaniemitchell.bsky.social

Tired: "BS"

Wired: "Vibe citing"

www.nytimes.com/2025/05/29/w...

White House Health Report Included Fake Citations

www.nytimes.com

May 30, 2025 at 8:15 PM

Besmira Nushi

@besmiranushi.bsky.social

📌You can now find all the evaluation logs from our inference-time scaling report and the Phi-4 reasoning technical report at huggingface.co/datasets/mic.... The evaluation code for the reasoning benchmarks can also be found in the main branch of Eureka ML Insights at github.com/microsoft/eu....

microsoft/Eureka-Bench-Logs · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

May 27, 2025 at 4:02 PM

Besmira Nushi

@besmiranushi.bsky.social

www.scientificamerican.com/article/unde... at this point one just needs to cross their fingers and hope for more sanity.

Under Trump, National Science Foundation Cuts Off All Funding to Scientists

National Science Foundation staff were told to freeze outgoing funding days after NSF leadership introduced a new policy that requires that grants be screened for “alignment with agency priorities”

www.scientificamerican.com

May 4, 2025 at 2:26 AM

Besmira Nushi

@besmiranushi.bsky.social

🎉The Phi-4 reasoning models have landed on HF and Azure AI Foundry. The new models are competitive and often outperform much larger frontier models. It is exciting to see the reasoning capabilities extend to more domains beyond math, including algorithmic reasoning, calendar planning, and coding.

May 1, 2025 at 12:50 AM

Reposted by Besmira Nushi

Dimitris Papailiopoulos

@dimitrisp.bsky.social

Re: The Chatbot Arena Illusion

Every eval chokes under hill climbing. If we're lucky, there’s an early phase where *real* learning (both model and community) can occur. I'd argue that a benchmark’s value lies entirely in that window. So the real question is what did we learn?

April 30, 2025 at 4:38 PM

Besmira Nushi

@besmiranushi.bsky.social

All Eureka inference-time scaling insights are now available here: www.microsoft.com/en-us/resear... It was fun sharing these and more together with Vidhisha Balachandran @vidhishab.bsky.social and Vibhav Vineet at #ICLR2025.

Eureka Inference-Time Scaling Insights: Where We Stand and What Lies Ahead - Microsoft Research

Understanding and measuring the potential of inference-time scaling for reasoning. The new Eureka study tests nine state-of-the-art models on eight diverse reasoning tasks.

www.microsoft.com

April 29, 2025 at 3:36 PM

Besmira Nushi

@besmiranushi.bsky.social

Come see us in any of the following sessions on model understanding and evaluation! 🔬 #ICLR2025 @msftresearch.bsky.social

April 24, 2025 at 1:38 AM

Besmira Nushi

@besmiranushi.bsky.social

💡Eureka inference-time scaling insight (Day 8): Reasoning models improve more efficiently upon receiving feedback from themselves on their solutions than conventional models on the most complex tasks.

April 21, 2025 at 8:04 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news