Taylor Sorensen
banner
taylor-sorensen.bsky.social
Taylor Sorensen
@taylor-sorensen.bsky.social
NLP PhD Candidate at UW
In new work, we introduce a simple post-training method and large-scale resource for maximizing diversity and coverage! We call it Spectrum Tuning.
More on this in the coming days - but I'm really excited about this work, and am so happy that it's now public
October 8, 2025 at 2:25 PM
This may seem like a silly toy example - shouldn’t we just use np.randint()?
Fair - but this simple case is illustrative of a broader weakness. What about creative writing? Or hypothesis generation? Or diverse data generation?
We need models that SPAN the entire output space.
October 8, 2025 at 2:25 PM
Current post-training teaches a model to output the highest reward answer, even if there are other good answers. E.g. when picking random numbers, 7 seems like the most “random” number to annotators - so models ALWAYS pick 7!
arxiv.org/pdf/2505.00047
arxiv.org/pdf/2203.02155
arxiv.org/pdf/2510.01171
October 8, 2025 at 2:24 PM
Did you know that LLMs suffer from serious mode collapse?

For example, if you ask models to tell you a joke, they almost always tell you the same joke? This is true across samples and even across model families!

Why does this happen? Can we improve it?
October 8, 2025 at 2:22 PM
Apologies if this wasn't clear! They're provided textually in an in-context prompt, which the model then tries to steer towards

And yes you are absolutely right, that's one of the risks of personalization in general (see great paper here: arxiv.org/pdf/2303.05453)
March 21, 2025 at 9:36 PM
Read the full paper here!
arxiv.org/abs/2503.15484
March 20, 2025 at 3:57 AM
As a last experiment, we simulate an annotator population ("jury learning" @mitchellgordon) using with our trained models and value profiles.

We find that the instance-level interannotator agreement (IAA) predicted by our simulated population correlates with the observed IAA.
(14/?)
March 20, 2025 at 3:53 AM
We also find that our value profile system is very well-calibrated.

This calibration is important for trusting the model's confidence and for disentangling value-related epistemic uncertainty from aleatoric uncertainty in rater variation.
(13/?)
March 20, 2025 at 3:52 AM
Value profiles are written in natural language. But are they actually semantically interpretable? Does the system change its judgments with wording changes in common-sense ways?

Yes, we find that semantic changes in value profile lead to expected changes in the output.
(12/?)
March 20, 2025 at 3:52 AM
Clustering with value profiles also enables dataset-level qualitative analysis.

For example, for OQA/DIC, even restricting to just 2 clusters explains the majority of rater variation, suggesting a bimodal distribution.

Additionally, the profile descriptions suggest why people may disagree.
(11/?)
March 20, 2025 at 3:52 AM
Our algorithm is effective at uncovering useful rater groupings, with the resulting value profile clusters outperform the most performant demographic grouping!

Additionally, on the dataset where demographics helped most, the clusters partially recover ideological trends.
(10/?)
March 20, 2025 at 3:51 AM
To characterize common modes of (dis)agreement, we introduce a value-based clustering algorithm.

Unlike traditional methods, ours: 1) does not require that raters label overlapping instances, 2) leverages semantic instance information, and 3) returns cluster descriptions.
(9/?)
March 20, 2025 at 3:51 AM
Since the value profiles are inferred (or compressed) from in-context examples, information can only be lost. But how much is retained?

We find that value profiles generated by Gemini preserve the majority (>70%) of the useful predictive information!
(8/?)
March 20, 2025 at 3:51 AM
Okay, onto results - we find that:

- Using a rater's in-context examples improve predictions most
- Value profiles significantly improve predictions (but not as much as examples)
- Demographics _do not_ offer a significant performance boost (except for OpinionQA)
(7/?)
March 20, 2025 at 3:50 AM
We carry out experiments on six datasets relevant to model alignment, content moderation, and computational social science!

OpinionQA (Santurkar et al.)
Hatespeech (Kumar et al.)
DICES (Aroyo et al.)
Habermas (Tessler/@mbakker.bsky.social et al.)
Prism @hannahrosekirk.bsky.social
ValuePrism
(6/?)
March 20, 2025 at 3:50 AM
Given the different representations, how can we tell which is most useful for modeling variation?

We apply an information-theoretic methodology to measure the amount of model-usable information in a rater representation for predicting an individual's ratings.
(5/?)
March 20, 2025 at 3:47 AM
In the absence of a person's value self-description, we infer rater values from data via an autoencoder setup.

An encoder proposes values that could explain a person's ratings, and a decoder generalizes to held-out examples based on the value description.
(4/?)
March 20, 2025 at 3:46 AM
How can we represent an individual?

We compare four different representations:
- No information about an individual (baseline)
- Demographics
- In-context rater demonstrations
- "Value profile": natural language description of a rater's values relevant to the task
(3/?)
March 20, 2025 at 3:46 AM
First, we characterize current approaches to modeling variation. Modeling a group (e.g., demographic) can lead to stereotyping, while modeling a population distribution doesn't tell you who disagrees or why.

Instead, we focus on modeling at the individual level.
(2/?)
March 20, 2025 at 3:45 AM
🤔🤖Most AI systems assume there’s just one right answer—but many tasks have reasonable disagreement. How can we better model human variation? 🌍✨

We propose modeling at the individual-level using open-ended, textual value profiles! 🗣️📝

arxiv.org/abs/2503.15484
March 20, 2025 at 3:45 AM