Alexander Hoyle
alexanderhoyle.bsky.social
Alexander Hoyle
@alexanderhoyle.bsky.social
Postdoctoral fellow at ETH AI Center, working on Computational Social Science + NLP. Previously a PhD in CS at UMD, advised by Philip Resnik. Internships at MSR, AI2. he/him.

On the job market this cycle!

alexanderhoyle.com
What. Where can I read more about this. I had no idea
November 10, 2025 at 7:35 PM
Are you here??
November 6, 2025 at 7:41 AM
Realizing my point is perhaps a bit undercut because of my typo haha

Oddly I have seen much more LinkedIn use for Zürich AI things
November 3, 2025 at 5:41 PM
yea pleais is a “household name“ among AI people largely because of your Twitter presence , I’d expect
November 2, 2025 at 3:46 PM
The naive approach is to "just ask": instruct the LLM the output a score on the provided scale

However, this does not work very well---LLM outputs tend to cluster or "heap" around certain integers (and do so inconsistently between models)
October 28, 2025 at 6:23 AM
Thanks for the catch!!
October 28, 2025 at 5:25 AM
We cover many more models in the paper and have more insights and analysis there! This paper was really a team effort over a long period, and I think it is dense with interesting results
October 27, 2025 at 2:59 PM
Two takeaways:
- Directly prompting on a scale is surprisingly fine, but *only if* you take the token-probability weighted average over scale items, Σ⁹ₙ₌₁ int(n) ⋅ p(n|x) (cf @victorwang37.bsky.social)
- Finetuning w/ a smaller model can do really well! And with as few as 1,000 paired annotations
October 27, 2025 at 2:59 PM
So we evaluate finetuning, pairwise prompting, and direct (pointwise) prompting

As ground truth, we use human-annotated pairwise ranks on 3 constructs in social science from prior work (ad negativity, grandstanding, and fear about immigration), inducing scores via Bradley-Terry
October 27, 2025 at 2:59 PM
This collaboration began because some of us thought the more principled approach is to instead compare pairs of items, then induce a score with Bradley-Terry

After all, it is easier for *people* to compare items relatively than to score them directly
October 27, 2025 at 2:59 PM
The naive approach is to "just ask": instruct the LLM the output a score on the provided scale

However, this does not work very well---LLM outputs tend to cluster or "heap" around certain integers (and do so inconsistently between models)
October 27, 2025 at 2:59 PM
As someone who's sat behind people playing games on a laptop in class, I have found it disturbing. Other research bears this out. You are in fact impacting others

psycnet.apa.org/record/2013-...
www.sciencedirect.com/science/arti...
overview: 3starlearningexperiences.wordpress.com/2018/01/09/l...
October 22, 2025 at 5:43 AM
Here's a nice recent paper showing that models post-2020 (ie LLMs) are more robust to various types of input noise

arxiv.org/pdf/2403.03923

For more on resource use, I found this blog post very informative: andymasley.substack.com/p/individual...
arxiv.org
October 18, 2025 at 7:33 PM
I mean, feel free to look at performance on the WMT benchmarks yourself. The initial improvements in MT are a key part of the reason transformers have become so dominant. Regardless, as I said, the LSTM-based approach was less efficient than transformers anyway
October 18, 2025 at 7:27 PM
Your original claim that transformer based LLMs didn’t noticeably improve MT is incorrect though. MT was the testbed for Attention is all you Need
October 18, 2025 at 3:41 AM
Google Translate incorporated transformers in 2020. My recollection is that quality before then was passable for high resource languages but couldn’t reliably do full articles

but those RNNs were also *less* efficient; transformers were lauded precisely because they were so much more efficient
October 18, 2025 at 3:40 AM
Was basing my estimate off coding the graph in matplotlib not Excel, but it ultimately was a pretty simple visualization
October 17, 2025 at 11:12 AM
Yeah, maybe 15-20 minutes, fair. Table was formatted in the paper and I didn't have easy access to the original input data, so I'd have needed to manually copy numbers or convert latex to something machine-readable first (5-10 min?). Then another 5-10 for formatting the barchart
October 17, 2025 at 11:11 AM
I think there are many efficiencies (both at the hardware and software level) still left on the table that I expect to change the calculus relative to something like Uber, which was predicated on full self-driving coming online all at once
October 17, 2025 at 10:03 AM
I was making some slides this week, and used Claude to convert a table in one of my papers to a barchart in ~3 minutes (including spot checks) the other evening. It would have taken a half hour *minimum* otherwise, and it freed me up to watch a sitcom with my wife. Pretty great if you ask me!
October 17, 2025 at 9:58 AM
The MT you're referring to was still, by most technical definitions, LLM-based
October 17, 2025 at 9:52 AM