Caleb Ziems
banner
calebziems.com
Caleb Ziems
@calebziems.com
PhD student at Stanford NLP. Working on Social NLP and CSS. Previously at GaTech, Meta AI, Emory.

📍Palo Alto, CA
🔗 calebziems.com
Our implementation of Culture Cartography is based on Farsight (Wang et al., 2024).

This was an interdisciplinary effort across computer science (@diyiyang.bsky.social, @williamheld.com, Jane Yu) and sociology (David Grusky and Amir Goldberg), and the research process taught me so much!
November 4, 2025 at 5:38 PM
Finally, Culture Cartography is aligned with prior notions of culture evals in our field.

We observe positive transfer performance from Cartography to two leading benchmarks: BLEnD (Myung et al., 2024) and CulturalBench (Chiu et al., 2024).
November 4, 2025 at 5:35 PM
Compared to knowledge extraction, Culture Cartography is less prone to test-set contamination.

We evaluate GPT-4o with and without search and find no significant difference in their recall on Cartography data.

Culture Cartography is "Google proof" since search doesn't help.
November 4, 2025 at 5:34 PM
Compared to traditional annotation, Culture Cartography more often elicits knowledge that is unknown to LLMs.

Qwen-2 72 B recalls 21% less Cartography data than it recalls traditional data (p < .0001)

Even a strong reasoning model (R1) is challenged more by our data.
November 4, 2025 at 5:33 PM
We propose a mixed-initiative method called Culture Cartography.

And to find challenging questions, we let the LLM steer towards topics it has low confidence in.

To find culturally-representative knowledge, we let the human steer towards what they find most salient.
November 4, 2025 at 5:33 PM
Other benchmarks use knowledge extracted from the rich cultural artifacts that humans actively produce on the web.

Still this is a single-initiative process.

Researchers can’t steer the distribution towards questions of interest (i.e., those that challenge LLMs).
November 4, 2025 at 5:32 PM
How are prior benchmarks constructed?

In traditional annotation, the researcher picks some questions and the annotator passively provides ground truth answers.

This is single-initiative.

Annotators don't steer the process, so their interests and culture may not be represented.
November 4, 2025 at 5:32 PM
November 22, 2024 at 2:42 AM
👋
November 19, 2024 at 8:48 PM
November 19, 2024 at 4:57 PM