Lightnews — Scholar-powered news

Jessy Li

@jessyjli.bsky.social

2.5K followers 460 following 55 posts

https://jessyli.com Associate Professor, UT Austin Linguistics.
Part of UT Computational Linguistics https://sites.utexas.edu/compling/ and UT NLP https://www.nlp.utexas.edu/

Posts Replies Media Videos

Jessy Li

@jessyjli.bsky.social

Test your models and see if they just memorize or truly understand!

PLSemanticsBench - where formal meets informal!

arxiv.org/abs/2510.03415

Team: Aditya Thimmaiah, Jiyang Zhang, Jayanth Srinivasa, Milos Gligoric

PLSemanticsBench: Large Language Models As Programming Language Interpreters

As large language models (LLMs) excel at code reasoning, a natural question arises: can an LLM execute programs (i.e., act as an interpreter) purely based on a programming language's formal semantics?...

arxiv.org

October 14, 2025 at 2:33 AM

Jessy Li

@jessyjli.bsky.social

So what's really happening⁉️
LLMs aren't interpreting rules -- they're recalling patterns.
Their "understanding" is promising... but shallow.

💡It's time to test semantics, not just syntax.💡
To move from surface-level memorization → true symbolic reasoning.

October 14, 2025 at 2:33 AM

Jessy Li

@jessyjli.bsky.social

Change the rules -- swap (+ with -) or replace (+ with novel symbols) operators -- and accuracy collapses.
Models that were "near-perfect" drop to single digits. 😬

October 14, 2025 at 2:33 AM

Jessy Li

@jessyjli.bsky.social

Here is a genuine one :) CosmicAI’s AstroVisBench, to appear at #NeurIPS

bsky.app/profile/nsfs...

NSF-Simons AI Institute for Cosmic Origins (CosmicAI) @nsfsimonscosmicai.bsky.social · Sep 25

Exciting news! Introducing AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy!

A new benchmark developed by researchers at the NSF-Simons AI Institute for Cosmic Origins is testing how well LLMs implement scientific workflows in astronomy and visualize results.

October 2, 2025 at 2:03 PM

Jessy Li

@jessyjli.bsky.social

Would be great to chat at COLM!

August 16, 2025 at 5:11 AM

Reposted by Jessy Li

Kyle Lo

@kylelo.bsky.social

long range narrative understanding, even basic fact checking that humans easily get near perfect on, has barely improved in LMs over years novelchallenge.github.io

NoCha leaderboard

novelchallenge.github.io

August 15, 2025 at 3:55 PM

Jessy Li

@jessyjli.bsky.social

Yes, at least need other data (like Echos in AI), quality measure (LitBench), also what we did in QUDsim was to make sure the stories are from posts pre-LLM to prevent AI stories. Further, The way they measure style + semantic diversity doesn't align with how they define it (only capture lexical)

August 15, 2025 at 1:20 PM

Reposted by Jessy Li

Adina Williams

@adinawilliams.bsky.social

I agree this thread's headline claim seems premature. Let me add our recent ACL Findings paper, with Dexter Ju and @hagenblix.bsky.social, which found syntactic simplification in at least some LMs, in a novel domain regeneration setting: aclanthology.org/2025.finding...

aclanthology.org

August 15, 2025 at 4:35 AM

Jessy Li

@jessyjli.bsky.social

Nice, reading level, syntactic complexity, and sentence structures are great angles to study this!!

August 15, 2025 at 5:20 AM

Jessy Li

@jessyjli.bsky.social

Thanks :) Yes will be there, let's catch up!

August 12, 2025 at 9:03 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news