Lightnews — Scholar-powered news

Caleb Ziems

@calebziems.com

1.8K followers 480 following 17 posts

PhD student at Stanford NLP. Working on Social NLP and CSS. Previously at GaTech, Meta AI, Emory.

📍Palo Alto, CA
🔗 calebziems.com

Posts Replies Media Videos

Caleb Ziems

@calebziems.com

Thanks to many @stanfordnlp.bsky.social members for feedback! @juliakruk.bsky.social @yanzhe.bsky.social @myra.bsky.social @jaredlcm.bsky.social

May be of interest to @paul-rottger.bsky.social @monadiab77.bsky.social @vinodkpg.bsky.social @dbamman.bsky.social @davidjurgens.bsky.social and you

November 4, 2025 at 6:04 PM

Caleb Ziems

@calebziems.com

Our implementation of Culture Cartography is based on Farsight (Wang et al., 2024).

This was an interdisciplinary effort across computer science (@diyiyang.bsky.social, @williamheld.com, Jane Yu) and sociology (David Grusky and Amir Goldberg), and the research process taught me so much!

November 4, 2025 at 5:38 PM

Caleb Ziems

@calebziems.com

Finally, Culture Cartography is aligned with prior notions of culture evals in our field.

We observe positive transfer performance from Cartography to two leading benchmarks: BLEnD (Myung et al., 2024) and CulturalBench (Chiu et al., 2024).

November 4, 2025 at 5:35 PM

Caleb Ziems

@calebziems.com

Compared to knowledge extraction, Culture Cartography is less prone to test-set contamination.

We evaluate GPT-4o with and without search and find no significant difference in their recall on Cartography data.

Culture Cartography is "Google proof" since search doesn't help.

November 4, 2025 at 5:34 PM

Caleb Ziems

@calebziems.com

Compared to traditional annotation, Culture Cartography more often elicits knowledge that is unknown to LLMs.

Qwen-2 72 B recalls 21% less Cartography data than it recalls traditional data (p < .0001)

Even a strong reasoning model (R1) is challenged more by our data.

November 4, 2025 at 5:33 PM

Caleb Ziems

@calebziems.com

We propose a mixed-initiative method called Culture Cartography.

And to find challenging questions, we let the LLM steer towards topics it has low confidence in.

To find culturally-representative knowledge, we let the human steer towards what they find most salient.

November 4, 2025 at 5:33 PM

Caleb Ziems

@calebziems.com

Other benchmarks use knowledge extracted from the rich cultural artifacts that humans actively produce on the web.

Still this is a single-initiative process.

Researchers can’t steer the distribution towards questions of interest (i.e., those that challenge LLMs).

November 4, 2025 at 5:32 PM

Caleb Ziems

@calebziems.com

How are prior benchmarks constructed?

In traditional annotation, the researcher picks some questions and the annotator passively provides ground truth answers.

This is single-initiative.

Annotators don't steer the process, so their interests and culture may not be represented.