Caleb Ziems
@calebziems.com
PhD student at Stanford NLP. Working on Social NLP and CSS. Previously at GaTech, Meta AI, Emory.
📍Palo Alto, CA
🔗 calebziems.com
📍Palo Alto, CA
🔗 calebziems.com
Thanks to many @stanfordnlp.bsky.social members for feedback! @juliakruk.bsky.social @yanzhe.bsky.social @myra.bsky.social @jaredlcm.bsky.social
May be of interest to @paul-rottger.bsky.social @monadiab77.bsky.social @vinodkpg.bsky.social @dbamman.bsky.social @davidjurgens.bsky.social and you
May be of interest to @paul-rottger.bsky.social @monadiab77.bsky.social @vinodkpg.bsky.social @dbamman.bsky.social @davidjurgens.bsky.social and you
November 4, 2025 at 6:04 PM
Thanks to many @stanfordnlp.bsky.social members for feedback! @juliakruk.bsky.social @yanzhe.bsky.social @myra.bsky.social @jaredlcm.bsky.social
May be of interest to @paul-rottger.bsky.social @monadiab77.bsky.social @vinodkpg.bsky.social @dbamman.bsky.social @davidjurgens.bsky.social and you
May be of interest to @paul-rottger.bsky.social @monadiab77.bsky.social @vinodkpg.bsky.social @dbamman.bsky.social @davidjurgens.bsky.social and you
Our implementation of Culture Cartography is based on Farsight (Wang et al., 2024).
This was an interdisciplinary effort across computer science (@diyiyang.bsky.social, @williamheld.com, Jane Yu) and sociology (David Grusky and Amir Goldberg), and the research process taught me so much!
This was an interdisciplinary effort across computer science (@diyiyang.bsky.social, @williamheld.com, Jane Yu) and sociology (David Grusky and Amir Goldberg), and the research process taught me so much!
November 4, 2025 at 5:38 PM
Our implementation of Culture Cartography is based on Farsight (Wang et al., 2024).
This was an interdisciplinary effort across computer science (@diyiyang.bsky.social, @williamheld.com, Jane Yu) and sociology (David Grusky and Amir Goldberg), and the research process taught me so much!
This was an interdisciplinary effort across computer science (@diyiyang.bsky.social, @williamheld.com, Jane Yu) and sociology (David Grusky and Amir Goldberg), and the research process taught me so much!
Finally, Culture Cartography is aligned with prior notions of culture evals in our field.
We observe positive transfer performance from Cartography to two leading benchmarks: BLEnD (Myung et al., 2024) and CulturalBench (Chiu et al., 2024).
We observe positive transfer performance from Cartography to two leading benchmarks: BLEnD (Myung et al., 2024) and CulturalBench (Chiu et al., 2024).
November 4, 2025 at 5:35 PM
Finally, Culture Cartography is aligned with prior notions of culture evals in our field.
We observe positive transfer performance from Cartography to two leading benchmarks: BLEnD (Myung et al., 2024) and CulturalBench (Chiu et al., 2024).
We observe positive transfer performance from Cartography to two leading benchmarks: BLEnD (Myung et al., 2024) and CulturalBench (Chiu et al., 2024).
Compared to knowledge extraction, Culture Cartography is less prone to test-set contamination.
We evaluate GPT-4o with and without search and find no significant difference in their recall on Cartography data.
Culture Cartography is "Google proof" since search doesn't help.
We evaluate GPT-4o with and without search and find no significant difference in their recall on Cartography data.
Culture Cartography is "Google proof" since search doesn't help.
November 4, 2025 at 5:34 PM
Compared to knowledge extraction, Culture Cartography is less prone to test-set contamination.
We evaluate GPT-4o with and without search and find no significant difference in their recall on Cartography data.
Culture Cartography is "Google proof" since search doesn't help.
We evaluate GPT-4o with and without search and find no significant difference in their recall on Cartography data.
Culture Cartography is "Google proof" since search doesn't help.
Compared to traditional annotation, Culture Cartography more often elicits knowledge that is unknown to LLMs.
Qwen-2 72 B recalls 21% less Cartography data than it recalls traditional data (p < .0001)
Even a strong reasoning model (R1) is challenged more by our data.
Qwen-2 72 B recalls 21% less Cartography data than it recalls traditional data (p < .0001)
Even a strong reasoning model (R1) is challenged more by our data.
November 4, 2025 at 5:33 PM
Compared to traditional annotation, Culture Cartography more often elicits knowledge that is unknown to LLMs.
Qwen-2 72 B recalls 21% less Cartography data than it recalls traditional data (p < .0001)
Even a strong reasoning model (R1) is challenged more by our data.
Qwen-2 72 B recalls 21% less Cartography data than it recalls traditional data (p < .0001)
Even a strong reasoning model (R1) is challenged more by our data.
We propose a mixed-initiative method called Culture Cartography.
And to find challenging questions, we let the LLM steer towards topics it has low confidence in.
To find culturally-representative knowledge, we let the human steer towards what they find most salient.
And to find challenging questions, we let the LLM steer towards topics it has low confidence in.
To find culturally-representative knowledge, we let the human steer towards what they find most salient.
November 4, 2025 at 5:33 PM
We propose a mixed-initiative method called Culture Cartography.
And to find challenging questions, we let the LLM steer towards topics it has low confidence in.
To find culturally-representative knowledge, we let the human steer towards what they find most salient.
And to find challenging questions, we let the LLM steer towards topics it has low confidence in.
To find culturally-representative knowledge, we let the human steer towards what they find most salient.
Other benchmarks use knowledge extracted from the rich cultural artifacts that humans actively produce on the web.
Still this is a single-initiative process.
Researchers can’t steer the distribution towards questions of interest (i.e., those that challenge LLMs).
Still this is a single-initiative process.
Researchers can’t steer the distribution towards questions of interest (i.e., those that challenge LLMs).
November 4, 2025 at 5:32 PM
Other benchmarks use knowledge extracted from the rich cultural artifacts that humans actively produce on the web.
Still this is a single-initiative process.
Researchers can’t steer the distribution towards questions of interest (i.e., those that challenge LLMs).
Still this is a single-initiative process.
Researchers can’t steer the distribution towards questions of interest (i.e., those that challenge LLMs).
How are prior benchmarks constructed?
In traditional annotation, the researcher picks some questions and the annotator passively provides ground truth answers.
This is single-initiative.
Annotators don't steer the process, so their interests and culture may not be represented.
In traditional annotation, the researcher picks some questions and the annotator passively provides ground truth answers.
This is single-initiative.
Annotators don't steer the process, so their interests and culture may not be represented.
November 4, 2025 at 5:32 PM
How are prior benchmarks constructed?
In traditional annotation, the researcher picks some questions and the annotator passively provides ground truth answers.
This is single-initiative.
Annotators don't steer the process, so their interests and culture may not be represented.
In traditional annotation, the researcher picks some questions and the annotator passively provides ground truth answers.
This is single-initiative.
Annotators don't steer the process, so their interests and culture may not be represented.
@butanium.bsky.social I nominate @aryaman.io
November 19, 2024 at 4:57 PM
@butanium.bsky.social I nominate @aryaman.io