Lightnews — Scholar-powered news

Jennifer Hu

@jennhu.bsky.social

2.5K followers 180 following 57 posts

Asst Prof at Johns Hopkins Cognitive Science • Director of the Group for Language and Intelligence (GLINT) ✨• Interested in all things language, cognition, and AI

jennhu.github.io

Posts Replies Media Videos

Jennifer Hu

@jennhu.bsky.social

This work was done with an amazing team: @wegotlieb.bsky.social, @siyuansong.bsky.social, @kmahowald.bsky.social, @rplevy.bsky.social

Preprint (pre-TACL version): arxiv.org/abs/2510.16227

10/10

Screenshot of paper title and list of authors. The title of the paper is: "What Can String Probability Tell Us About Grammaticality?" The authors are: Jennifer Hu, Ethan Gotlieb Wilcox, Siyuan Song, Kyle Mahowald, and Roger P. Levy.

November 10, 2025 at 10:11 PM

Jennifer Hu

@jennhu.bsky.social

As mentioned above, Prediction #3 shows that recent criticism about the overlap in probabilities across gram/ungram strings should NOT be interpreted as a failure of probability to tell us about grammaticality.

This overlap is to be expected if prob is influenced by factors other than gram. 7/10

Screenshot of a figure with two panels, labeled (a) and (b). The caption reads: "Figure 4: Evaluation of Prediction 3. (a) Distributions of scores are highly overlapping across grammatical and ungrammatical sentences (pooled across datasets). (b) Poor separability (area under receiver operating characteristic curve, or AUC) achieved by each model and probability transformation. Horizontal line at 0.5 indicates no separation. For dataset-specific results, see Section B, Figures 5 and 7."

November 10, 2025 at 10:11 PM

Jennifer Hu

@jennhu.bsky.social

We use our framework to derive 3 predictions, which we validate empirically:

1. Correlation btwn the prob of string probs within minimal pairs

2. Correlation btwn LMs’ and humans’ deltas within minimal pairs

3. Poor separation btwn prob of unpaired grammatical and ungrammatical strings

6/10

Screenshot of a figure with two panels, labeled (a) and (b). The caption reads: "Figure 2: (a) Prediction 1a: Logprobs of paired grammatical (x-axis) and ungrammatical (y-axis) sentences are correlated. Dashed line: x = y. (b) Prediction 1b: Correlation between grammatical and ungrammatical logprobs (y-axis) generally decreases as within-pair cosine distance (x-axis) increases."

November 10, 2025 at 10:11 PM

Jennifer Hu

@jennhu.bsky.social

New work to appear @ TACL!

Language models (LMs) are remarkably good at generating novel well-formed sentences, leading to claims that they have mastered grammar.

Yet they often assign higher probability to ungrammatical strings than to grammatical strings.

How can both things be true? 🧵👇

Screenshot of a figure with two panels, labeled (a) and (b). The caption reads: "Figure 1: (a) Illustration of messages (left) and strings (right) in toy domain. Blue = grammatical strings. Red = ungrammatical strings. (b) Surprisal (negative log probability) assigned to toy strings by GPT-2."

November 10, 2025 at 10:11 PM

Jennifer Hu

@jennhu.bsky.social

We then test whether measures of forward-pass dynamics (including competitor interference, & others) predict signatures of processing in humans.

We find that dynamic measures improve prediction of human measures above static (final-layer) measures -- across models, domains, & modalities.

(7/12)

Screenshot of Figure 3, which has two panels, labeled (a) and (b). The caption says the following. Figure 3: Experiment 2 results for text domains. (a) Top: R2 achieved by model processing measures (x-axis) across groups of human DVs (hue). Bottom: Log Bayes Factor comparing critical to baseline regression models. Horizontal line = log(3). (b) Mean R2 across bins of model sizes.

May 20, 2025 at 2:26 PM

Jennifer Hu

@jennhu.bsky.social

First, we use simple mech interp tools to measure competitor interference, such as evidence for “two-stage processing” and the “time to decision”.
We find that models indeed appear to initially favor a competing incorrect answer in the cases where we expect decision conflict in humans.

(6/12)

Screenshot of Figure 2, which has three panels, labeled (a), (b), and (c). The caption says the following. Figure 2: Experiment 1 results. (a) LMs generally show stronger signs of two-stage processing for the items with competing intuitive answers. Asterisks denote sig. t-tests comparing means across conditions within each domain. (b) ∆LogProb across layers for sample LMs in the capitals recall domain, illustrating different processing strategies. (c) Two-stage processing interacts with size.

May 20, 2025 at 2:26 PM

Jennifer Hu

@jennhu.bsky.social

Excited to share a new preprint w/ @michael-lepori.bsky.social & Michael Franke!

A dominant approach in AI/cogsci uses *outputs* from AI models (eg logprobs) to predict human behavior.

But how does model *processing* (across layers in a forward pass) relate to human real-time processing? 👇 (1/12)

Screenshot of Figure 1, which has two panels labeled (a) and (b). The caption states the following. Figure 1: Overview of our study. (a) Experiment 1: We explore whether forward passes show mechanistic signatures of competitor interference, first preferring a salient competing intuitive answer before preferring the correct answer. (b) Experiment 2: We systematically investigate the ability of dynamic measures derived from forward passes to predict indicators of processing load in humans.

May 20, 2025 at 2:26 PM

Jennifer Hu

@jennhu.bsky.social

(6/9) We put a suite of aligned models, and their instruction fine-tuned counterparts, to the test and found:
* no model reaches human-like diversity of thought.
* aligned models show LESS conceptual diversity than instruction fine-tuned counterparts

February 10, 2025 at 2:22 PM

Jennifer Hu

@jennhu.bsky.social

(5/9) Our experiments are inspired by human studies in two domains with rich behavioral data.

February 10, 2025 at 2:22 PM

Jennifer Hu

@jennhu.bsky.social

(4/9) We introduce a new way of measuring the conceptual diversity of synthetically-generated LLM "populations" by considering how its “individuals’” variability relates to that of the population.

February 10, 2025 at 2:22 PM

Jennifer Hu

@jennhu.bsky.social

To researchers doing LLM evaluation: prompting is *not a substitute* for direct probability measurements. Check out the camera-ready version of our work, to appear at EMNLP 2023! (w/ @rplevy.bsky.social)

Paper: arxiv.org/abs/2305.13264

Original thread: twitter.com/_jennhu/stat...

October 24, 2023 at 3:03 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news