Lightnews — Scholar-powered news

The naive approach is to "just ask": instruct the LLM the output a score on the provided scale

However, this does not work very well---LLM outputs tend to cluster or "heap" around certain integers (and do so inconsistently between models)

A grid of histograms showing how human- and model-assigned scores are distributed across three social-scientific constructs—Immigration Fear, Ad-Negativity, and Grandstanding.

The top row (“humans”) shows relatively smooth, varied distributions of Bradley–Terry (BT) scores for each construct, with multiple peaks or roughly symmetric shapes.

The bottom four rows show LLM-assigned score distributions (for Qwen-2.5–7B, Qwen-2.5–72B, Llama-3.1–8B, and Llama-3.3–70B). These histograms are sparse, with tall bars clustered around a few discrete values (e.g., 1, 3, or 8 on a 1–9 scale), illustrating “heaping” behavior where models favor certain numbers.

The x-axes represent BT scores (for humans) or LLM responses (for models), and y-axes represent relative frequency.

Overall, the figure visually contrasts human-like continuous scoring distributions with the more discretized, irregular outputs of different LLMs.

Copied figure caption (as printed):

Figure 1: Distributions of LLM scores for scalar constructs do not align with the reference distribution, nor do they correspond between models. Top: Distribution of text items’ scores on latent dimension for three different tasks estimated by fitting a Bradley-Terry (BT) model to human-annotated pairwise comparisons between text items. Bottom: Distribution of the scores different LLMs’ assign to the same text items if prompted to score them on a 1–9 scale.

October 28, 2025 at 6:23 AM

Alexander Hoyle

@alexanderhoyle.bsky.social

Paper: arxiv.org/abs/2509.03116

Code: github.com/haukelicht/s...

With:
@haukelicht.bsky.social *
@rupak-s.bsky.social *
@patrickwu.bsky.social
@pranavgoel.bsky.social
@niklasstoehr.bsky.social
@elliottash.bsky.social

github.com

October 28, 2025 at 6:20 AM

Alexander Hoyle

@alexanderhoyle.bsky.social

Thanks for the catch!!

October 28, 2025 at 5:25 AM

Alexander Hoyle

@alexanderhoyle.bsky.social

We cover many more models in the paper and have more insights and analysis there! This paper was really a team effort over a long period, and I think it is dense with interesting results

October 27, 2025 at 2:59 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

Two takeaways:
- Directly prompting on a scale is surprisingly fine, but *only if* you take the token-probability weighted average over scale items, Σ⁹ₙ₌₁ int(n) ⋅ p(n|x) (cf @victorwang37.bsky.social)
- Finetuning w/ a smaller model can do really well! And with as few as 1,000 paired annotations

A comparison chart showing two evaluation metrics for measuring scalar constructs across three datasets: Immigration Fear, Ad-Negativity, and Grandstanding. The top panel displays Spearman's Rank Correlation (ρ) with values ranging from approximately 0.5 to 0.9, while the bottom panel shows Root Mean Squared Error (RMSE) with values between 0.15 and 0.30 (lower is better). Results are shown for five different language models: Qwen-72B Pairwise (blue), Qwen-72B Pointwise (green), and DeBERTa-v3 (all data) (orange). Arrows and curved lines at the top indicate comparisons between 'Prompted' and 'Finetuned' approaches. The chart demonstrates that pointwise prompting generally achieves higher correlations than pairwise comparisons, while finetuned models show competitive performance with lower error rates, particularly in the Grandstanding dataset. More models are in the final paper

October 27, 2025 at 2:59 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

So we evaluate finetuning, pairwise prompting, and direct (pointwise) prompting

As ground truth, we use human-annotated pairwise ranks on 3 constructs in social science from prior work (ad negativity, grandstanding, and fear about immigration), inducing scores via Bradley-Terry

October 27, 2025 at 2:59 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

This collaboration began because some of us thought the more principled approach is to instead compare pairs of items, then induce a score with Bradley-Terry

After all, it is easier for *people* to compare items relatively than to score them directly

Alt text for Figure 4 (from Measuring Scalar Constructs in Social Science with LLMs):

A visual example of a pairwise comparison task used to measure Ad-Negativity.

The figure shows two light-blue boxes, each containing short excerpts from political campaign ads labeled Text 1 and Text 2.

Text 1 says: “[Announcer]: America was built on democratic principles. But, here's one simple question—What if your vote wasn't private….”

Text 2 says: “[Announcer]: They're at it again. Powerful interests with false attacks on Mark Udall. The facts: Mark Udall's voted to ….”

Below them, a dark-blue box poses the prompt:
“Which campaign ad is more negative towards the mentioned opposing candidates?”
This illustrates how human annotators or language models are asked to judge which of two texts better exemplifies a target construct such as negativity.

Copied figure caption (as printed):

Figure 4: Pairwise comparison places two text items relative to one another regarding a given construct.

October 27, 2025 at 2:59 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

The naive approach is to "just ask": instruct the LLM the output a score on the provided scale

However, this does not work very well---LLM outputs tend to cluster or "heap" around certain integers (and do so inconsistently between models)

October 27, 2025 at 2:59 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

As someone who's sat behind people playing games on a laptop in class, I have found it disturbing. Other research bears this out. You are in fact impacting others

psycnet.apa.org/record/2013-...
www.sciencedirect.com/science/arti...
overview: 3starlearningexperiences.wordpress.com/2018/01/09/l...

October 22, 2025 at 5:43 AM

Alexander Hoyle

@alexanderhoyle.bsky.social

Here's a nice recent paper showing that models post-2020 (ie LLMs) are more robust to various types of input noise

arxiv.org/pdf/2403.03923

For more on resource use, I found this blog post very informative: andymasley.substack.com/p/individual...

arxiv.org

October 18, 2025 at 7:33 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

I mean, feel free to look at performance on the WMT benchmarks yourself. The initial improvements in MT are a key part of the reason transformers have become so dominant. Regardless, as I said, the LSTM-based approach was less efficient than transformers anyway

October 18, 2025 at 7:27 PM

Alexander Hoyle

@alexanderhoyle.bsky.social

Your original claim that transformer based LLMs didn’t noticeably improve MT is incorrect though. MT was the testbed for Attention is all you Need

October 18, 2025 at 3:41 AM

Alexander Hoyle

@alexanderhoyle.bsky.social

Google Translate incorporated transformers in 2020. My recollection is that quality before then was passable for high resource languages but couldn’t reliably do full articles

but those RNNs were also *less* efficient; transformers were lauded precisely because they were so much more efficient

October 18, 2025 at 3:40 AM

Alexander Hoyle

@alexanderhoyle.bsky.social

Was basing my estimate off coding the graph in matplotlib not Excel, but it ultimately was a pretty simple visualization

October 17, 2025 at 11:12 AM

Alexander Hoyle

@alexanderhoyle.bsky.social

Yeah, maybe 15-20 minutes, fair. Table was formatted in the paper and I didn't have easy access to the original input data, so I'd have needed to manually copy numbers or convert latex to something machine-readable first (5-10 min?). Then another 5-10 for formatting the barchart

October 17, 2025 at 11:11 AM

Alexander Hoyle

@alexanderhoyle.bsky.social

I think there are many efficiencies (both at the hardware and software level) still left on the table that I expect to change the calculus relative to something like Uber, which was predicated on full self-driving coming online all at once

October 17, 2025 at 10:03 AM

Alexander Hoyle

@alexanderhoyle.bsky.social

I was making some slides this week, and used Claude to convert a table in one of my papers to a barchart in ~3 minutes (including spot checks) the other evening. It would have taken a half hour *minimum* otherwise, and it freed me up to watch a sitcom with my wife. Pretty great if you ask me!

October 17, 2025 at 9:58 AM

Alexander Hoyle

@alexanderhoyle.bsky.social

The MT you're referring to was still, by most technical definitions, LLM-based

October 17, 2025 at 9:52 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news