Lightnews — Scholar-powered news

Alexander Hoyle

@alexanderhoyle.bsky.social

Happy to be at #EMNLP2025! Please say hello and come see our lovely work

The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure — Tuesday at 11:00, Poster

Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification — Tuesday at 14:30, Demo

Measuring Scalar Constructs in Social Science with LLMs — Friday at 10:30, Oral at CSS

How Persuasive is Your Context? — Friday at 14:00, Poster

November 5, 2025 at 2:23 AM

Alexander Hoyle

@alexanderhoyle.bsky.social

[corrected link]

LLMs are often used for text annotation in social science. In some cases, this involves placing text items on a scale: eg, 1 for liberal and 9 for conservative

There are a few ways to handle this task. Which work best? Our new EMNLP paper has some answers🧵
arxiv.org/abs/2509.03116

A diagram illustrating pointwise scoring with a large language model (LLM). At the top is a text box containing instructions: 'You will see the text of a political advertisement about a candidate. Rate it on a scale ranging from 1 to 9, where 1 indicates a positive view of the candidate and 9 indicates a negative view of the candidate.' Below this is a green text box containing an example ad text: 'Joe Biden is going to eat your grandchildren for dinner.' An arrow points down from this text to an illustration of a computer with 'LLM' displayed on its monitor. Finally, an arrow points from the computer down to the number '9' in large teal text, representing the LLM's scoring output. This diagram demonstrates how an LLM directly assigns a numerical score to text based on given criteria

October 28, 2025 at 6:23 AM

Alexander Hoyle

@alexanderhoyle.bsky.social

Two takeaways:
- Directly prompting on a scale is surprisingly fine, but *only if* you take the token-probability weighted average over scale items, Σ⁹ₙ₌₁ int(n) ⋅ p(n|x) (cf @victorwang37.bsky.social)
- Finetuning w/ a smaller model can do really well! And with as few as 1,000 paired annotations

A comparison chart showing two evaluation metrics for measuring scalar constructs across three datasets: Immigration Fear, Ad-Negativity, and Grandstanding. The top panel displays Spearman's Rank Correlation (ρ) with values ranging from approximately 0.5 to 0.9, while the bottom panel shows Root Mean Squared Error (RMSE) with values between 0.15 and 0.30 (lower is better). Results are shown for five different language models: Qwen-72B Pairwise (blue), Qwen-72B Pointwise (green), and DeBERTa-v3 (all data) (orange). Arrows and curved lines at the top indicate comparisons between 'Prompted' and 'Finetuned' approaches. The chart demonstrates that pointwise prompting generally achieves higher correlations than pairwise comparisons, while finetuned models show competitive performance with lower error rates, particularly in the Grandstanding dataset. More models are in the final paper

October 27, 2025 at 2:59 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

This collaboration began because some of us thought the more principled approach is to instead compare pairs of items, then induce a score with Bradley-Terry

After all, it is easier for *people* to compare items relatively than to score them directly

Alt text for Figure 4 (from Measuring Scalar Constructs in Social Science with LLMs):

A visual example of a pairwise comparison task used to measure Ad-Negativity.

The figure shows two light-blue boxes, each containing short excerpts from political campaign ads labeled Text 1 and Text 2.

Text 1 says: “[Announcer]: America was built on democratic principles. But, here's one simple question—What if your vote wasn't private….”

Text 2 says: “[Announcer]: They're at it again. Powerful interests with false attacks on Mark Udall. The facts: Mark Udall's voted to ….”

Below them, a dark-blue box poses the prompt:
“Which campaign ad is more negative towards the mentioned opposing candidates?”
This illustrates how human annotators or language models are asked to judge which of two texts better exemplifies a target construct such as negativity.

Copied figure caption (as printed):

Figure 4: Pairwise comparison places two text items relative to one another regarding a given construct.

October 27, 2025 at 2:59 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

The naive approach is to "just ask": instruct the LLM the output a score on the provided scale

However, this does not work very well---LLM outputs tend to cluster or "heap" around certain integers (and do so inconsistently between models)

A grid of histograms showing how human- and model-assigned scores are distributed across three social-scientific constructs—Immigration Fear, Ad-Negativity, and Grandstanding.

The top row (“humans”) shows relatively smooth, varied distributions of Bradley–Terry (BT) scores for each construct, with multiple peaks or roughly symmetric shapes.

The bottom four rows show LLM-assigned score distributions (for Qwen-2.5–7B, Qwen-2.5–72B, Llama-3.1–8B, and Llama-3.3–70B). These histograms are sparse, with tall bars clustered around a few discrete values (e.g., 1, 3, or 8 on a 1–9 scale), illustrating “heaping” behavior where models favor certain numbers.

The x-axes represent BT scores (for humans) or LLM responses (for models), and y-axes represent relative frequency.

Overall, the figure visually contrasts human-like continuous scoring distributions with the more discretized, irregular outputs of different LLMs.

Copied figure caption (as printed):

Figure 1: Distributions of LLM scores for scalar constructs do not align with the reference distribution, nor do they correspond between models. Top: Distribution of text items’ scores on latent dimension for three different tasks estimated by fitting a Bradley-Terry (BT) model to human-annotated pairwise comparisons between text items. Bottom: Distribution of the scores different LLMs’ assign to the same text items if prompted to score them on a 1–9 scale.

October 27, 2025 at 2:59 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

LLMs are often used for text annotation, especially in social science. In some cases, this involves placing text items on a scale: eg, 1 for liberal and 9 for conservative

There are a few ways to accomplish this task. Which work best? Our new EMNLP paper has some answers🧵
arxiv.org/pdf/2507.00828

October 27, 2025 at 2:59 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

a friend bought these a size too large and I love them . Can wear to work or (as I just did) on a long haul flight

August 15, 2025 at 3:55 AM

Alexander Hoyle

@alexanderhoyle.bsky.social

Despite the better metrics, we thought that erasure might degrade embeddings in ways we weren't measuring.

We applied LEACE models trained on our target datasets to out-of-domain embeddings from MTEB data. Surprisingly, MTEB metrics did not change!

Barchart of performance on STS tasks from LEACE, with and without erasure. Each group of bars compares the base and LEACE-erased models for MiniLM and E5-base-v2 embeddings

July 17, 2025 at 10:53 AM

Alexander Hoyle

@alexanderhoyle.bsky.social

Applying linear erasure to remove source/language information from text embeddings (say, from sentence transformers) produces dramatic improvements on document similarty & clustering tasks

We use LEACE (‪@norabelrose.bsky.social‬ et al. 2023), which is also cheap to run (seconds on a laptop)

Screenshot of table showing differences in cross-lingual document simularity search, showing that linear erasure improves recall@1 and @10 across several models. Here, the concept is the document’s language. Erasure improves recall of the paired item in all cases, in some instances improving smaller models over their larger counterparts.

July 17, 2025 at 10:53 AM

Alexander Hoyle

@alexanderhoyle.bsky.social

Attributes like language/source are confounders that distort distance-based applications

Debiasing methods remove unwanted information from embeddings—linear concept erasure in particular makes it so a linear predictor cannot recover a concept (eg, lang) from the representation

Example clustering code. Can be viewed here: https://github.com/y-fn/deconfounding-text-embeddings/blob/main/cluster_example.py

July 17, 2025 at 10:53 AM

Alexander Hoyle

@alexanderhoyle.bsky.social

New preprint! Have you ever tried to cluster text embeddings from different sources, but the clusters just reproduce the sources? Or attempted to retrieve similar documents across multiple languages, and even multilingual embeddings return items in the same language?

Turns out there's an easy fix🧵

Barchart of number of items in four clusters of text embeddings, with colors showing the distribution of sources in each cluster.

Caption: Clustering text embeddings from disparate sources (here, U.S. congressional bill summaries and senators’ tweets) can produce clusters where one source dominates (Panel A). Using linear erasure to remove the source information produces more evenly balanced clusters that maintain semantic coherence (Panel B; sampled items relate to immigration). Four random clusters of k-means shown (k=25), trained on a combined 5,000 samples from each dataset

July 17, 2025 at 10:53 AM

Alexander Hoyle

@alexanderhoyle.bsky.social

The protocol is also easily adapted to LLM judges: We call ours ProxAnn. While LLMs aren't perfect substitutes, they are about as good as an arbitrary human annotator

Illustration of an LLM-judge following the fit and ranking steps in the protocol

Table of advantage probabilities from the alternative annotator test. Most scores are above 0.5; models evaluated are GPT-4o, Llama 8B/70B, Qwen 3B/32B/72B

Caption: Advantage probabilities from the alternative annotator test; the probability that PROXANN is “as good as or better than a randomly chosen human annotator” (Calderon et al., 2025). Document-level scores consider annotations by document; Topic-level over all documents evaluated in the topic. ∗ indicates that win rates over humans are above 0.5, as determined by a one-sided t-test (over 10 resamples of combined annotators). † is the equivalent for Wilcoxon signed-rank.

July 8, 2025 at 12:40 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

Models are then evaluated by measuring whether annotations agree with model outputs: that is, do annotator scores correlate with document-topic probabilities (or distance to centroid)?

A human study finds that, in line with other work, classic LDA (Mallet) continues to work well

Diagram showing how human annotator scores are correlated with document-topic distributions from the topic model

Boxplots of human--human and human--topic model correlations.

Caption: Annotators review the top documents and words from a single topic and infer a category (Label Step), then assign scores to additional documents based on their relationship to the category (Fit and Rank Steps). These scores are correlated with each other (Inter–Annotator Kendall’s τ ) and with the model’s document-topic estimates (θk; TM–annotator τ ). There are eight topics per model; boxplots report variation in τ over each topic-annotator tuple.

July 8, 2025 at 12:40 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

The setup approximates real world qualitative content analysis. An annotator

1. Reviews a small collection of documents (& top words) for a topic, and writes down a category label
2. Determines whether new documents fit that label
3. Ranks the documents by relevance to the label

Illustration of step 1 of the protocol. A model outputs the top words and documents, and an annotator reviews them and assigns a label

Step 2 of the protocol: annotators review unseen documents and assign them a score (on a 1-5 scale) based on how well they fit the category

In the final step, they rank the documents by their relevance to the category

July 8, 2025 at 12:40 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

In addition, standard evaluations don't really correspond to any real-world use case, and also don't align well with human judgments of topic coherence (per our previous work)

In this paper, we design a new evaluation protocol and LLM-as-judge proxy, ProxAnn

Scatterplot with topic NPMI (automated coherence) on the x-axis and human ratings of topics on the y-axis. Correlation is good for the early part of the NPMI range (0-0.2), but breaks down thereafter

July 8, 2025 at 12:40 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

How do standard metrics work? Automated coherence computes how often the top n words in a topic appear together in some reference text (eg, Wikipedia)

This fails to consider which *documents* are associated with each topic, and so doesn't transfer well to text clustering methods

Illustration of Normalized Pointwise Mutual Information. NPMI is a form of automated coherence. Very simply, NPMI takes all pairs of words in each topic and sees how often they appear together in some outside reference text like Wikipedia. The more often the words appear together, the better! Each topic receives an NPMI score, and we can evaluate an entire model by averaging NPMI over all of its K topics

July 8, 2025 at 12:40 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

Evaluating topic models (and document clustering methods) is hard. In fact, since our paper critiquing standard evaluation practices four years ago, there hasn't been a good replacement metric

That ends today (we hope)! Our new ACL paper introduces an LLM-based evaluation protocol 🧵

Screenshot of first page of paper. It is here: https://arxiv.org/pdf/2507.00828

Abstract: Topic model and document-clustering evaluations either use automated metrics that align poorly with human preferences or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners' real-world usage of models. Annotators -- or an LLM-based proxy -- review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations

July 8, 2025 at 12:40 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

They added multi-file search!

May 11, 2025 at 9:55 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

You'd think so, but no

March 20, 2025 at 10:18 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

un peu, oui

February 2, 2025 at 4:12 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

this holiday season I am thankful that, rather than fixing the literally decade-old problem of multi-file search, Overleaf instead implemented the world's worst writing assistance tool

screenshot of writeful grammar correction in an overleaf document with paywall

github issue for overleaf from 2014, titled "search can't search multiple files"

December 12, 2024 at 10:37 AM

Alexander Hoyle

@alexanderhoyle.bsky.social

Once again thinking about this description of a George Wallace campaign rally (from Gary Wills’ “Nixon Agonistes”)

“He’ll have to go the whole way to satisfy this audience. “Ah hadn’ meant to say this tonight, but yew-know, if one of those hippies lays down in front of mah car when Ah become President …” They drown out the punch line in happy fulfilled anger. Refrain of some favorite song, it is too longed-for to be audible when it comes.
Their happiness is enough to break the heart. They vomit laughter. Trying to eject the vacuum inside them. They are not hungry or underprivileged or deprived in material ways. Each has, in some minor way, “made it.” And it all means nothing. Washington does not care. The children do not care. They have worked, and for what? As I looked through the crowd—the very young, and then a jump to middle age, no college students there but the protesting peaceniks—I wondered if the young mother from the street corner was there (someone watching her bright smear of baby), the one who screamed at the marching priests. Had the policeman come, the one who said last night that he did not back off in fourteen years? Had he turned in his resignation that day?—the[…]”

Excerpt From
Nixon Agonistes
Garry Wills

November 7, 2024 at 5:36 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

The recent New Yorker piece where he features heavily gave an interesting perspective

www.newyorker.com/magazine/202...

February 23, 2024 at 6:05 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

the last time i had fried chicken in a bucket it was actually ice cream coated in corn flakes. fakery!!!

Ice cream cleverly designed to look like fried chicken

February 7, 2024 at 10:09 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

But as I mentioned in response to @tedunderwood.me on Twitter, the picture Krippendorf paints is a bit more nuanced---that it the variation in interpretation is desirable.

I suppose that the issue for me is that a wordlist is too information-sparse to ground a close reading on their own

The meanings invoked by texts need not be shared. Although intersubjective agreement as to what an author meant to say or what a given text means to a com- munity would simplify a content analysis tremendously, such consensus rarely exists in fact. Demanding that analysts limit themselves to “common” or “shared ground” would restrict the empirical domain of content analysis to the most trivial or “manifest aspects of communications,” on which Berelson’s definition insists, or it would restrict the use of content analysis to a small community of message producers, recipients, and analysts who happen to see the world from a common perspective. If content analysts were not allowed to read texts in ways that are dif- ferent from the ways other readers do, content analysis would be pointless.

November 3, 2023 at 8:37 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news