Daniel Khashabi
banner
danielkhashabi.bsky.social
Daniel Khashabi
@danielkhashabi.bsky.social
I play with intuitions and data.

Now: @jhuclsp @jhucompsci
Past: @allen_ai @uwnlp @Penn @cogcomp @Illinois_Alma @MSFTResearch
For years since the GPT-2 paper, emergent in-context learning (ICL) from 'next-token' training has been treated as something deeply tied to 𝐡𝐮𝐦𝐚𝐧 𝐥𝐚𝐧𝐠𝐮𝐚𝐠𝐞. But … is it?
November 18, 2025 at 5:27 PM
Big congrats to @jackjingyuzhang for being named an Amazon AI PhD Fellow! 🎉 Grateful for @AmazonScience @RohitPrasadAI’s support as we work together to advance AI research at JHU.
x.com/jackjingyuz...
October 24, 2025 at 4:08 PM
ICL and SFT are the two most studied ways to adapt LMs. We understand each in isolation — but far less about how they might 𝗰𝗼𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁 𝗼𝗻𝗲 𝗮𝗻𝗼𝘁𝗵𝗲𝗿.
October 3, 2025 at 2:23 PM
Imagine this: excited about the recent progress, you’ve built an agentic system that uses 🔧tools (API calls) to solve complex problems. What could go wrong?

We studied agentic tool recovery—when your LLM selects a set of tools to execute, but one turns out to be unavailable or incorrect.
September 19, 2025 at 2:29 PM
A core hurdles in AI safety eval is that benchmarks (e.g., those on jailbreak attacks) quickly become outdated shortly after they are released (e.g., saturate, contaminate, patched).
August 26, 2025 at 9:15 PM
A core hurdles in AI safety eval is that benchmarks (e.g., those on jailbreak attacks) quickly become outdated shortly after they are released (e.g., saturate, contaminate, patched).
August 26, 2025 at 2:59 PM
Excited to collaborate up with LMArena, NIH, and DataTecnica to launch BiomedArena! Our goal is to advance the use of LLMs in biomedical discovery and incorporate community-driven insights
to help shape the future of biomedical AI.

⚔️ Check it out: biomedarena.ai
August 19, 2025 at 8:33 PM
What’s really going on inside LLMs when they handle non-English queries?

Niyati Bafna @niyatibafna.bsky.social 's recent work introduces the **translation barrier hypothesis**, a framework for understanding multilingual model behavior.

Paper: huggingface.co/papers/2506...
Paper page - The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure
huggingface.co
July 7, 2025 at 12:15 PM
Reposted by Daniel Khashabi
🔈When LLMs solve tasks with a mid-to-low resource input or target language, their output quality is poor. We know that. But can we put our finger on what breaks inside the LLM? We introduce the 💥 translation barrier hypothesis 💥 for failed multilingual generation with LLMs. arxiv.org/abs/2506.22724
July 4, 2025 at 5:05 PM
🚨New LLM benchmark🚨 We're releasing BiomedSQL🔬 for tabular reasoning over large-scale biomedical databases. This includes questions based on implicit scientific conventions—like statistical thresholds, effect direction, and drug approval status.

📄 Preprint: arxiv.org/pdf/2505.20321
May 29, 2025 at 12:10 PM
Long-form inputs (e.g., needle-in-haystack setups) are the crucial aspect of high-impact LLM applications. While previous studies have flagged issues like positional bias and distracting documents, they've missed a crucial element: the size of the gold/relevant context.
May 28, 2025 at 1:16 AM
There have been various efforts on disentangling "task learning" vs "task recall" in LLMs. We've recently explored a fresh angle by borrowing from cryptography: with substitution ciphers, we transform a given task into an equivalent, but cryptic (no pun intended!!) forms.
May 22, 2025 at 9:30 PM
What is a university without "freedom of speech"?

Apparently, ChatGPT has a better grasp than @nyuniversity.

x.com/nebedaay/st...
May 16, 2025 at 7:54 PM
**Certified Mitigation of Worst-Case LLM Copyright Infringement**

TL;DR: We propose BloomScrub a framework to certifiably remove long verbatim quotes to reduce the risk of copyright violations.
May 12, 2025 at 8:52 PM
Can LLMs can be co-pilots for peer review?

Answering this requires evaluating *evaluate* whether LLMs can provide critiques that are *grounded* in the context of science papers.

See @JiefuOu's dataset which has a collection of paper claims and their critiques: arxiv.org/pdf/2503.21717
April 30, 2025 at 4:00 PM
📣📣📣 Tianjian @tli104 and I have refreshed our course material!

self-supervised.cs.jhu.edu/sp2025/

These resources may be helpful if you're:
(1) looking for slides to teach about LLMs, or
(2) interested in diving deeper into the field.
CSCI 601.771: Self-supervised Models
Discussing latest breakthroughs in self-supervised language models
self-supervised.cs.jhu.edu
April 29, 2025 at 8:30 PM
Reposted by Daniel Khashabi
I will be at #NAACL2025 to present our LLM creativity benchmark. Drop by if interested (Poster Session 8, Fri, May 2)!

I'd love to chat about RL and its interpretability, data influence for post-training, CogSci for LLM. Feel free to reach out and let's have some coffee together ☕ !
April 28, 2025 at 7:53 PM
Highlighting our #NAACL2025 papers 🧵🧵🧵
April 28, 2025 at 12:30 PM
People rely on search engines/chatbots to access science.

But what if you want a bird’s-eye view of science, or to identify over- and under-explored areas?

We introduce 🔺Science Hierarchography🔺, the goal of organizing science papers into conceptual hierarchies.

arxiv.org/abs/2504.13834
April 23, 2025 at 12:30 PM
Highlighting our #ICLR2025 papers 🧵🧵🧵

(1) "GenEx: Generating an Explorable World"
openreview.net/pdf?id=8NlU...

TLDR— Physical exploration can be expensive, and even impossible. Our proposed policy mitigates this by enabling agents to form an imaginative model of the 3D world.
April 21, 2025 at 12:35 PM
The flow of talent across institutions keeps research vibrant.
Excited that students from my lab are off to top PhD programs!

Muhan Gao @muhan_gao→ Texas A&M
Zhouxiang Feng @FocusV857→ Rice
Abe Hou @abe_hou→ Stanford
Taiming Lu @TaiMingLu→ Princeton
Dongwei Jiang @Dongwei__Jiang→ USC
April 16, 2025 at 10:05 PM
Research is important, but so is recharging!
Took the team out for some badminton fun today—amazing energy, lots of laughs, and a reminder of how lucky I am to work with this crew!
April 14, 2025 at 3:09 AM
Several collaborators expressed frustration, feeling that reviewers are excessively harsh. Might this behavior be influenced by the anonymity of the review process—similar to the dynamic we often observe on platforms such as Reddit?
April 4, 2025 at 3:06 AM
Can a simulated society of AI agents be used to assess the effectiveness of social policies?

See Abe Hou @abe_hou 's study in the context of "vaccine hesitancy" where we can use historical data for comparison and validation.

arxiv.org/abs/2503.09639
Can A Society of Generative Agents Simulate Human Behavior and...
Can we simulate a sandbox society with generative agents to model human behavior, thereby reducing the over-reliance on real human trials for assessing public policies? In this work, we...
arxiv.org
April 3, 2025 at 8:51 PM
Hello world!
April 3, 2025 at 1:24 AM