Lightnews — Scholar-powered news

Daniel Khashabi

@danielkhashabi.bsky.social

For years since the GPT-2 paper, emergent in-context learning (ICL) from 'next-token' training has been treated as something deeply tied to 𝐡𝐮𝐦𝐚𝐧 𝐥𝐚𝐧𝐠𝐮𝐚𝐠𝐞. But … is it?

November 18, 2025 at 5:27 PM

Daniel Khashabi

@danielkhashabi.bsky.social

Imagine this: excited about the recent progress, you’ve built an agentic system that uses 🔧tools (API calls) to solve complex problems. What could go wrong?

We studied agentic tool recovery—when your LLM selects a set of tools to execute, but one turns out to be unavailable or incorrect.

September 19, 2025 at 2:29 PM

Daniel Khashabi

@danielkhashabi.bsky.social

But how do we measure benchmark effectiveness? A key premise is that, effectiveness of attack prompts on dev models can predict their effectiveness on unseen eval models. Jack verifies that indeed this is the case: his resulting benchmark JBDistill-Bench is effective on a *unseen* models.

August 26, 2025 at 9:15 PM

Daniel Khashabi

@danielkhashabi.bsky.social

But how do we measure benchmark effectiveness? A key premise is that, effectiveness of attack prompts on dev models can predict their effectiveness on unseen eval models. Jack verifies that indeed this is the case: his resulting benchmark JBDistill-Bench is effective on a *unseen* models.

August 26, 2025 at 2:59 PM

Daniel Khashabi

@danielkhashabi.bsky.social

Highlighting a key result in the figure: when we inspect intermediate layers, we see that models often solve the task in the wrong (off-target_ language; that is, high off-target accuracy early on. Only in the later layers does the answer get translated into the intended language.

July 7, 2025 at 12:15 PM

Daniel Khashabi

@danielkhashabi.bsky.social

🚨New LLM benchmark🚨 We're releasing BiomedSQL🔬 for tabular reasoning over large-scale biomedical databases. This includes questions based on implicit scientific conventions—like statistical thresholds, effect direction, and drug approval status.

📄 Preprint: arxiv.org/pdf/2505.20321

May 29, 2025 at 12:10 PM

Daniel Khashabi

@danielkhashabi.bsky.social

In our latest study, we look into how the size of these gold contexts impacts LLM performance in needle-in-a-haystack scenarios. The verdict? **Smaller gold contexts severely amplify positional bias.**

May 28, 2025 at 1:16 AM

Daniel Khashabi

@danielkhashabi.bsky.social

There have been various efforts on disentangling "task learning" vs "task recall" in LLMs. We've recently explored a fresh angle by borrowing from cryptography: with substitution ciphers, we transform a given task into an equivalent, but cryptic (no pun intended!!) forms.

May 22, 2025 at 9:30 PM

Daniel Khashabi

@danielkhashabi.bsky.social

What is a university without "freedom of speech"?

Apparently, ChatGPT has a better grasp than @nyuniversity.

x.com/nebedaay/st...

May 16, 2025 at 7:54 PM

Daniel Khashabi

@danielkhashabi.bsky.social

Our approach: We propose BloomScrub 🧽, a framework that certifiably mitigates worst-case infringement risks while maintaining output utility.

* It's simple: Rewrite content by targeting and transforming the few longest quotes.

May 12, 2025 at 8:52 PM

Daniel Khashabi

@danielkhashabi.bsky.social

Can LLMs can be co-pilots for peer review?

Answering this requires evaluating *evaluate* whether LLMs can provide critiques that are *grounded* in the context of science papers.

See @JiefuOu's dataset which has a collection of paper claims and their critiques: arxiv.org/pdf/2503.21717

April 30, 2025 at 4:00 PM

Daniel Khashabi

@danielkhashabi.bsky.social

People rely on search engines/chatbots to access science.

But what if you want a bird’s-eye view of science, or to identify over- and under-explored areas?

We introduce 🔺Science Hierarchography🔺, the goal of organizing science papers into conceptual hierarchies.

arxiv.org/abs/2504.13834

April 23, 2025 at 12:30 PM

Daniel Khashabi

@danielkhashabi.bsky.social

Research is important, but so is recharging!
Took the team out for some badminton fun today—amazing energy, lots of laughs, and a reminder of how lucky I am to work with this crew!

April 14, 2025 at 3:09 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news