Daniel Khashabi
banner
danielkhashabi.bsky.social
Daniel Khashabi
@danielkhashabi.bsky.social
I play with intuitions and data.

Now: @jhuclsp @jhucompsci
Past: @allen_ai @uwnlp @Penn @cogcomp @Illinois_Alma @MSFTResearch
For years since the GPT-2 paper, emergent in-context learning (ICL) from 'next-token' training has been treated as something deeply tied to 𝐡𝐮𝐦𝐚𝐧 𝐥𝐚𝐧𝐠𝐮𝐚𝐠𝐞. But … is it?
November 18, 2025 at 5:27 PM
Imagine this: excited about the recent progress, you’ve built an agentic system that uses 🔧tools (API calls) to solve complex problems. What could go wrong?

We studied agentic tool recovery—when your LLM selects a set of tools to execute, but one turns out to be unavailable or incorrect.
September 19, 2025 at 2:29 PM
But how do we measure benchmark effectiveness? A key premise is that, effectiveness of attack prompts on dev models can predict their effectiveness on unseen eval models. Jack verifies that indeed this is the case: his resulting benchmark JBDistill-Bench is effective on a *unseen* models.
August 26, 2025 at 9:15 PM
But how do we measure benchmark effectiveness? A key premise is that, effectiveness of attack prompts on dev models can predict their effectiveness on unseen eval models. Jack verifies that indeed this is the case: his resulting benchmark JBDistill-Bench is effective on a *unseen* models.
August 26, 2025 at 2:59 PM
Highlighting a key result in the figure: when we inspect intermediate layers, we see that models often solve the task in the wrong (off-target_ language; that is, high off-target accuracy early on. Only in the later layers does the answer get translated into the intended language.
July 7, 2025 at 12:15 PM
🚨New LLM benchmark🚨 We're releasing BiomedSQL🔬 for tabular reasoning over large-scale biomedical databases. This includes questions based on implicit scientific conventions—like statistical thresholds, effect direction, and drug approval status.

📄 Preprint: arxiv.org/pdf/2505.20321
May 29, 2025 at 12:10 PM
In our latest study, we look into how the size of these gold contexts impacts LLM performance in needle-in-a-haystack scenarios. The verdict? **Smaller gold contexts severely amplify positional bias.**
May 28, 2025 at 1:16 AM
There have been various efforts on disentangling "task learning" vs "task recall" in LLMs. We've recently explored a fresh angle by borrowing from cryptography: with substitution ciphers, we transform a given task into an equivalent, but cryptic (no pun intended!!) forms.
May 22, 2025 at 9:30 PM
What is a university without "freedom of speech"?

Apparently, ChatGPT has a better grasp than @nyuniversity.

x.com/nebedaay/st...
May 16, 2025 at 7:54 PM
Our approach: We propose BloomScrub 🧽, a framework that certifiably mitigates worst-case infringement risks while maintaining output utility.

* It's simple: Rewrite content by targeting and transforming the few longest quotes.
May 12, 2025 at 8:52 PM
Can LLMs can be co-pilots for peer review?

Answering this requires evaluating *evaluate* whether LLMs can provide critiques that are *grounded* in the context of science papers.

See @JiefuOu's dataset which has a collection of paper claims and their critiques: arxiv.org/pdf/2503.21717
April 30, 2025 at 4:00 PM
People rely on search engines/chatbots to access science.

But what if you want a bird’s-eye view of science, or to identify over- and under-explored areas?

We introduce 🔺Science Hierarchography🔺, the goal of organizing science papers into conceptual hierarchies.

arxiv.org/abs/2504.13834
April 23, 2025 at 12:30 PM
Research is important, but so is recharging!
Took the team out for some badminton fun today—amazing energy, lots of laughs, and a reminder of how lucky I am to work with this crew!
April 14, 2025 at 3:09 AM