Alexander Hoyle
@alexanderhoyle.bsky.social
Postdoctoral fellow at ETH AI Center, working on Computational Social Science + NLP. Previously a PhD in CS at UMD, advised by Philip Resnik. Internships at MSR, AI2. he/him.
On the job market this cycle!
alexanderhoyle.com
On the job market this cycle!
alexanderhoyle.com
What. Where can I read more about this. I had no idea
November 10, 2025 at 7:35 PM
What. Where can I read more about this. I had no idea
Are you here??
November 6, 2025 at 7:41 AM
Are you here??
Realizing my point is perhaps a bit undercut because of my typo haha
Oddly I have seen much more LinkedIn use for Zürich AI things
Oddly I have seen much more LinkedIn use for Zürich AI things
November 3, 2025 at 5:41 PM
Realizing my point is perhaps a bit undercut because of my typo haha
Oddly I have seen much more LinkedIn use for Zürich AI things
Oddly I have seen much more LinkedIn use for Zürich AI things
yea pleais is a “household name“ among AI people largely because of your Twitter presence , I’d expect
November 2, 2025 at 3:46 PM
yea pleais is a “household name“ among AI people largely because of your Twitter presence , I’d expect
The naive approach is to "just ask": instruct the LLM the output a score on the provided scale
However, this does not work very well---LLM outputs tend to cluster or "heap" around certain integers (and do so inconsistently between models)
However, this does not work very well---LLM outputs tend to cluster or "heap" around certain integers (and do so inconsistently between models)
October 28, 2025 at 6:23 AM
Paper: arxiv.org/abs/2509.03116
Code: github.com/haukelicht/s...
With:
@haukelicht.bsky.social *
@rupak-s.bsky.social *
@patrickwu.bsky.social
@pranavgoel.bsky.social
@niklasstoehr.bsky.social
@elliottash.bsky.social
Code: github.com/haukelicht/s...
With:
@haukelicht.bsky.social *
@rupak-s.bsky.social *
@patrickwu.bsky.social
@pranavgoel.bsky.social
@niklasstoehr.bsky.social
@elliottash.bsky.social
github.com
October 28, 2025 at 6:20 AM
Thanks for the catch!!
October 28, 2025 at 5:25 AM
Thanks for the catch!!
We cover many more models in the paper and have more insights and analysis there! This paper was really a team effort over a long period, and I think it is dense with interesting results
October 27, 2025 at 2:59 PM
We cover many more models in the paper and have more insights and analysis there! This paper was really a team effort over a long period, and I think it is dense with interesting results
Two takeaways:
- Directly prompting on a scale is surprisingly fine, but *only if* you take the token-probability weighted average over scale items, Σ⁹ₙ₌₁ int(n) ⋅ p(n|x) (cf @victorwang37.bsky.social)
- Finetuning w/ a smaller model can do really well! And with as few as 1,000 paired annotations
- Directly prompting on a scale is surprisingly fine, but *only if* you take the token-probability weighted average over scale items, Σ⁹ₙ₌₁ int(n) ⋅ p(n|x) (cf @victorwang37.bsky.social)
- Finetuning w/ a smaller model can do really well! And with as few as 1,000 paired annotations
October 27, 2025 at 2:59 PM
Two takeaways:
- Directly prompting on a scale is surprisingly fine, but *only if* you take the token-probability weighted average over scale items, Σ⁹ₙ₌₁ int(n) ⋅ p(n|x) (cf @victorwang37.bsky.social)
- Finetuning w/ a smaller model can do really well! And with as few as 1,000 paired annotations
- Directly prompting on a scale is surprisingly fine, but *only if* you take the token-probability weighted average over scale items, Σ⁹ₙ₌₁ int(n) ⋅ p(n|x) (cf @victorwang37.bsky.social)
- Finetuning w/ a smaller model can do really well! And with as few as 1,000 paired annotations
So we evaluate finetuning, pairwise prompting, and direct (pointwise) prompting
As ground truth, we use human-annotated pairwise ranks on 3 constructs in social science from prior work (ad negativity, grandstanding, and fear about immigration), inducing scores via Bradley-Terry
As ground truth, we use human-annotated pairwise ranks on 3 constructs in social science from prior work (ad negativity, grandstanding, and fear about immigration), inducing scores via Bradley-Terry
October 27, 2025 at 2:59 PM
So we evaluate finetuning, pairwise prompting, and direct (pointwise) prompting
As ground truth, we use human-annotated pairwise ranks on 3 constructs in social science from prior work (ad negativity, grandstanding, and fear about immigration), inducing scores via Bradley-Terry
As ground truth, we use human-annotated pairwise ranks on 3 constructs in social science from prior work (ad negativity, grandstanding, and fear about immigration), inducing scores via Bradley-Terry
This collaboration began because some of us thought the more principled approach is to instead compare pairs of items, then induce a score with Bradley-Terry
After all, it is easier for *people* to compare items relatively than to score them directly
After all, it is easier for *people* to compare items relatively than to score them directly
October 27, 2025 at 2:59 PM
This collaboration began because some of us thought the more principled approach is to instead compare pairs of items, then induce a score with Bradley-Terry
After all, it is easier for *people* to compare items relatively than to score them directly
After all, it is easier for *people* to compare items relatively than to score them directly
The naive approach is to "just ask": instruct the LLM the output a score on the provided scale
However, this does not work very well---LLM outputs tend to cluster or "heap" around certain integers (and do so inconsistently between models)
However, this does not work very well---LLM outputs tend to cluster or "heap" around certain integers (and do so inconsistently between models)
October 27, 2025 at 2:59 PM
The naive approach is to "just ask": instruct the LLM the output a score on the provided scale
However, this does not work very well---LLM outputs tend to cluster or "heap" around certain integers (and do so inconsistently between models)
However, this does not work very well---LLM outputs tend to cluster or "heap" around certain integers (and do so inconsistently between models)
As someone who's sat behind people playing games on a laptop in class, I have found it disturbing. Other research bears this out. You are in fact impacting others
psycnet.apa.org/record/2013-...
www.sciencedirect.com/science/arti...
overview: 3starlearningexperiences.wordpress.com/2018/01/09/l...
psycnet.apa.org/record/2013-...
www.sciencedirect.com/science/arti...
overview: 3starlearningexperiences.wordpress.com/2018/01/09/l...
October 22, 2025 at 5:43 AM
As someone who's sat behind people playing games on a laptop in class, I have found it disturbing. Other research bears this out. You are in fact impacting others
psycnet.apa.org/record/2013-...
www.sciencedirect.com/science/arti...
overview: 3starlearningexperiences.wordpress.com/2018/01/09/l...
psycnet.apa.org/record/2013-...
www.sciencedirect.com/science/arti...
overview: 3starlearningexperiences.wordpress.com/2018/01/09/l...
Here's a nice recent paper showing that models post-2020 (ie LLMs) are more robust to various types of input noise
arxiv.org/pdf/2403.03923
For more on resource use, I found this blog post very informative: andymasley.substack.com/p/individual...
arxiv.org/pdf/2403.03923
For more on resource use, I found this blog post very informative: andymasley.substack.com/p/individual...
arxiv.org
October 18, 2025 at 7:33 PM
Here's a nice recent paper showing that models post-2020 (ie LLMs) are more robust to various types of input noise
arxiv.org/pdf/2403.03923
For more on resource use, I found this blog post very informative: andymasley.substack.com/p/individual...
arxiv.org/pdf/2403.03923
For more on resource use, I found this blog post very informative: andymasley.substack.com/p/individual...
I mean, feel free to look at performance on the WMT benchmarks yourself. The initial improvements in MT are a key part of the reason transformers have become so dominant. Regardless, as I said, the LSTM-based approach was less efficient than transformers anyway
October 18, 2025 at 7:27 PM
I mean, feel free to look at performance on the WMT benchmarks yourself. The initial improvements in MT are a key part of the reason transformers have become so dominant. Regardless, as I said, the LSTM-based approach was less efficient than transformers anyway
Your original claim that transformer based LLMs didn’t noticeably improve MT is incorrect though. MT was the testbed for Attention is all you Need
October 18, 2025 at 3:41 AM
Your original claim that transformer based LLMs didn’t noticeably improve MT is incorrect though. MT was the testbed for Attention is all you Need
Google Translate incorporated transformers in 2020. My recollection is that quality before then was passable for high resource languages but couldn’t reliably do full articles
but those RNNs were also *less* efficient; transformers were lauded precisely because they were so much more efficient
but those RNNs were also *less* efficient; transformers were lauded precisely because they were so much more efficient
October 18, 2025 at 3:40 AM
Google Translate incorporated transformers in 2020. My recollection is that quality before then was passable for high resource languages but couldn’t reliably do full articles
but those RNNs were also *less* efficient; transformers were lauded precisely because they were so much more efficient
but those RNNs were also *less* efficient; transformers were lauded precisely because they were so much more efficient
Was basing my estimate off coding the graph in matplotlib not Excel, but it ultimately was a pretty simple visualization
October 17, 2025 at 11:12 AM
Was basing my estimate off coding the graph in matplotlib not Excel, but it ultimately was a pretty simple visualization
Yeah, maybe 15-20 minutes, fair. Table was formatted in the paper and I didn't have easy access to the original input data, so I'd have needed to manually copy numbers or convert latex to something machine-readable first (5-10 min?). Then another 5-10 for formatting the barchart
October 17, 2025 at 11:11 AM
Yeah, maybe 15-20 minutes, fair. Table was formatted in the paper and I didn't have easy access to the original input data, so I'd have needed to manually copy numbers or convert latex to something machine-readable first (5-10 min?). Then another 5-10 for formatting the barchart
I think there are many efficiencies (both at the hardware and software level) still left on the table that I expect to change the calculus relative to something like Uber, which was predicated on full self-driving coming online all at once
October 17, 2025 at 10:03 AM
I think there are many efficiencies (both at the hardware and software level) still left on the table that I expect to change the calculus relative to something like Uber, which was predicated on full self-driving coming online all at once
I was making some slides this week, and used Claude to convert a table in one of my papers to a barchart in ~3 minutes (including spot checks) the other evening. It would have taken a half hour *minimum* otherwise, and it freed me up to watch a sitcom with my wife. Pretty great if you ask me!
October 17, 2025 at 9:58 AM
I was making some slides this week, and used Claude to convert a table in one of my papers to a barchart in ~3 minutes (including spot checks) the other evening. It would have taken a half hour *minimum* otherwise, and it freed me up to watch a sitcom with my wife. Pretty great if you ask me!
The MT you're referring to was still, by most technical definitions, LLM-based
October 17, 2025 at 9:52 AM
The MT you're referring to was still, by most technical definitions, LLM-based