Jessy Li
jessyjli.bsky.social
Jessy Li
@jessyjli.bsky.social
https://jessyli.com Associate Professor, UT Austin Linguistics.
Part of UT Computational Linguistics https://sites.utexas.edu/compling/ and UT NLP https://www.nlp.utexas.edu/
Reposted by Jessy Li
New work to appear @ TACL!

Language models (LMs) are remarkably good at generating novel well-formed sentences, leading to claims that they have mastered grammar.

Yet they often assign higher probability to ungrammatical strings than to grammatical strings.

How can both things be true? 🧵👇
November 10, 2025 at 10:11 PM
Incredibly honored to serve as #EMNLP 2026 Program Chair along with @sunipadev.bsky.social and Hung-yi Lee, and General Chair @andre-t-martins.bsky.social. Looking forward to Budapest!!

(With thanks to Lisa Chuyuan Li who took this photo in Suzhou!)
November 8, 2025 at 2:39 AM
Reposted by Jessy Li
Delighted Sasha's (first year PhD!) work using mech interp to study complex syntax constructions won an Outstanding Paper Award at EMNLP!

Also delighted the ACL community continues to recognize unabashedly linguistic topics like filler-gaps... and the huge potential for LMs to inform such topics!
November 7, 2025 at 6:22 PM
Think your LLMs “understand” words like although/but/therefore? Think again!

They perform at chance for making inferences from certain discourse connectives expressing concession
"Although I hate leafy vegetables, I prefer daxes to blickets." Can you tell if daxes are leafy vegetables? LM's can't seem to! 📷

We investigate if LMs capture these inferences from connectives when they cannot rely on world knowledge.

New paper w/ Daniel, Will, @jessyjli.bsky.social
October 16, 2025 at 5:02 PM
🚨 Does your LLM really understand code -- or is it just really good at remembering it?
We built **PLSemanticsBench** to find out.
The results: a wild mix.

✅The Brilliant:
Top reasoning models can execute complex, fuzzer-generated programs -- even with 5+ levels of nested loops! 🤯

❌The Brittle: 🧵
October 14, 2025 at 2:33 AM
Reposted by Jessy Li
Find my students and collaborators at COLM this week!

Tuesday morning: @juand-r.bsky.social and @ramyanamuduri.bsky.social 's papers (find them if you missed it!)

Wednesday pm: @manyawadhwa.bsky.social 's EvalAgent

Thursday am: @anirudhkhatry.bsky.social 's CRUST-Bench oral spotlight + poster
October 7, 2025 at 6:03 PM
We’re hiring faculty as well! Happy to talk about it at COLM!
UT Austin Linguistics is hiring in computational linguistics!

Asst or Assoc.

We have a thriving group sites.utexas.edu/compling/ and a long proud history in the space. (For instance, fun fact, Jeff Elman was a UT Austin Linguistics Ph.D.)

faculty.utexas.edu/career/170793

🤘
UT Austin Computational Linguistics Research Group – Humans processing computers processing humans processing language
sites.utexas.edu
October 8, 2025 at 1:17 AM
Reposted by Jessy Li
Can we quantify what makes some text read like AI "slop"? We tried 👇
"AI slop" seems to be everywhere, but what exactly makes text feel like "slop"?

In our new work (w/ @tuhinchakr.bsky.social, Diego Garcia-Olano, @byron.bsky.social ) we provide a systematic attempt at measuring AI "slop" in text!

arxiv.org/abs/2509.19163

🧵 (1/7)
September 24, 2025 at 1:28 PM
Reposted by Jessy Li
On my way to #COLM2025 🍁

Check out jessyli.com/colm2025

QUDsim: Discourse templates in LLM stories arxiv.org/abs/2504.09373

EvalAgent: retrieval-based eval targeting implicit criteria arxiv.org/abs/2504.15219

RoboInstruct: code generation for robotics with simulators arxiv.org/abs/2405.20179
October 6, 2025 at 3:50 PM
Reposted by Jessy Li
Traveling to my first @colmweb.org🍁

Not presenting anything but here are two posters you should visit:

1. @qyao.bsky.social on Controlled rearing for direct and indirect evidence for datives (w/ me, @weissweiler.bsky.social and @kmahowald.bsky.social), W morning

Paper: arxiv.org/abs/2503.20850
Both Direct and Indirect Evidence Contribute to Dative Alternation Preferences in Language Models
Language models (LMs) tend to show human-like preferences on a number of syntactic phenomena, but the extent to which these are attributable to direct exposure to the phenomena or more general propert...
arxiv.org
October 6, 2025 at 3:22 PM
All of us (@kanishka.bsky.social @kmahowald.bsky.social and me) are looking for PhD students this cycle! If computational linguistics/NLP is your passion, join us at UT Austin!

For my areas see jessyli.com
September 30, 2025 at 7:30 PM
Can AI aid scientists amidst their own workflows, when they do not know step-by-step workflows and may not know, in advance, the kinds of scientific utility a visualization would bring?

Check out @sebajoe.bsky.social’s feature on ✨AstroVisBench:
Exciting news! Introducing AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy!

A new benchmark developed by researchers at the NSF-Simons AI Institute for Cosmic Origins is testing how well LLMs implement scientific workflows in astronomy and visualize results.
September 25, 2025 at 8:52 PM
Reposted by Jessy Li
📣 NEW HCTS course developed in collaboration with @tephi-tx.bsky.social: AI in Health Communication 📣

Explore responsible applications and best practices for maximizing impact and building trust with @utaustin.bsky.social experts @jessyjli.bsky.social & @mackert.bsky.social.

💻: rebrand.ly/HCTS_AI
September 4, 2025 at 5:02 PM
Reposted by Jessy Li
long range narrative understanding, even basic fact checking that humans easily get near perfect on, has barely improved in LMs over years novelchallenge.github.io
NoCha leaderboard
novelchallenge.github.io
August 15, 2025 at 3:55 PM
Reposted by Jessy Li
🤖 🧠 NEW PAPER ON COGSCI & AI 🧠 🤖

Recent neural networks capture properties long thought to require symbols: compositionality, productivity, rapid learning

So what role should symbols play in theories of the mind? For our answer...read on!

Paper: arxiv.org/abs/2508.05776

1/n
August 15, 2025 at 4:27 PM
Reposted by Jessy Li
I agree this thread's headline claim seems premature. Let me add our recent ACL Findings paper, with Dexter Ju and @hagenblix.bsky.social, which found syntactic simplification in at least some LMs, in a novel domain regeneration setting: aclanthology.org/2025.finding...
aclanthology.org
August 15, 2025 at 4:35 AM
The Echoes in AI paper showed quite the opposite with also a story continuation setup.
Additionally, we present evidence that both *syntactic* and *discourse* diversity measures show strong homogenization that lexical and cosine used in this paper do not capture.
August 12, 2025 at 9:01 PM
Tuesday at #ACL2025: Jan will be presenting this from 4-5:30pm in x4/x5!
Turns out content selection in LLMs are highly consistent with each other, but not so much with their own notion of importance or with human’s…
Do you want to know what information LLMs prioritize in text synthesis tasks? Here's a short 🧵 about our new paper, led by Jan Trienes: an interpretable framework for salience analysis in LLMs.

First of all, information salience is a fuzzy concept. So how can we even measure it? (1/6)
July 28, 2025 at 9:51 PM
Reposted by Jessy Li
Looking forward to attending #cogsci2025 (Jul 29 - Aug 3)! I’m especially excited to meet students who will be applying to PhD programs in Computational Ling/CogSci in the coming cycle.

Please reach out if you want to meet up and chat! Email is the best way, but DM also works if you must!

quick🧵:
July 28, 2025 at 9:20 PM
If you’re heading to ICML, check out Hongli’s work on context-specific alignment!
I'll be at #ICML to present SPRI next week! Come by our poster on Tuesday, July 15, 4:30pm, and let’s catch up on LLM alignment! 😃

🚀TL;DR: We introduce Situated-PRInciples (SPRI), a framework that automatically generates input-specific principles to align responses — with minimal human effort.

🧵
July 11, 2025 at 6:01 PM
Check out this new opinion piece from Sebastian and Lily! We have really powerful AI systems now, so what’s the bottleneck preventing the wider adoption of fact checking systems, in high stakes scenarios like medicine? It’s how we define the tasks 👇
Are we fact-checking medical claims the right way? 🩺🤔

Probably not. In our study, even experts struggled to verify Reddit health claims using end-to-end systems.

We show why—and argue fact-checking should be a dialogue, with patients in the loop

arxiv.org/abs/2506.20876

🧵1/
July 2, 2025 at 4:38 PM
We have very good frameworks for cooperative dialog… but how about the opposite? @asher-zheng.bsky.social’s new paper takes a game-theoretic view and develops new metrics to quantify non-cooperative language ♟️

Turns out LLMs don’t have the pragmatic capabilities to perceive these…
Language is often strategic, but LLMs tend to play nice. How strategic are they really? Probing into that is key for future safety alignment.

👉Introducing CoBRA🐍, a framework that assesses strategic language.

Work with my amazing advisors @jessyjli.bsky.social and @David I. Beaver!
June 3, 2025 at 8:47 PM
Reposted by Jessy Li
Congrats to Sebastian et al. This is our first fully within-CosmicAI research project, setting the foundation for our future work.
How good are LLMs at 🔭 scientific computing and visualization 🔭?

AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results.

SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧵
June 2, 2025 at 10:03 PM
Is AI ready to play a real role in science? This work with @nsfsimonscosmicai.bsky.social evaluates LLMs targeting the implementation of scientific workflows, and the scientific utility of visualizations from LLM-generated code -- and the answer is not yet, even with the best SOTA models 👇
How good are LLMs at 🔭 scientific computing and visualization 🔭?

AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results.

SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧵
June 2, 2025 at 7:45 PM