Eddie Yang
eddieyang.bsky.social
Eddie Yang
@eddieyang.bsky.social
Based on these findings (and more in the paper), we offer recommendations for best practices. We also summarized the recs in a checklist to facilitate a more principled procedure.
October 20, 2025 at 1:57 PM
Finding 4: Bias-correction methods like DSL can reduce bias, but they introduce a trade-off: corrected estimates often have larger standard errors, requiring a large ground-truth sample (600-1000+) to be beneficial without losing too much precision.
October 20, 2025 at 1:57 PM
Finding 3: In-context learning (providing a few annotated examples in the prompt) offers only marginal improvements in reliability, with benefits plateauing quickly. Changes to prompt format has a small effect (smaller and reasoning models more sensitive).
October 20, 2025 at 1:57 PM
Finding 2: This disagreement has significant downstream consequences. Re-running the original analyses with LLM annotations produced highly variable coefficient estimates, often altering the conclusions of the original studies.
October 20, 2025 at 1:57 PM
There is also an interesting linear relationship between LLM-human and LLM-LLM annotation agreement: when LLMs agree more with each other, they also tend to agree more with humans and supervised models! We gave some suggestions on what annotation tasks are good for LLMs.
October 20, 2025 at 1:57 PM
Finding 1: LLM annotations show pretty low intercoder reliability with the original annotations (coded by humans or supervised models). Perhaps surprisingly, reliability among the different LLMs themselves is only moderate (larger models better).
October 20, 2025 at 1:57 PM
New paper: LLMs are increasingly used to label data in political science. But how reliable are these annotations, and what are the consequences for scientific findings? What are best practices? Some new findings from a large empirical evaluation.
Paper: eddieyang.net/research/llm_annotation.pdf
October 20, 2025 at 1:57 PM
Based on these findings (and more in the paper), we offer recommendations for best practices. We also summarized the recommendations in a checklist to facilitate a more principled procedure.
October 20, 2025 at 1:53 PM
Finding 4: Bias-correction methods like DSL can reduce bias, but they introduce a trade-off: corrected estimates often have larger standard errors, requiring a large ground-truth sample (600-1000+) to be beneficial without losing too much precision.
October 20, 2025 at 1:53 PM
Finding 3: In-context learning (providing a few annotated examples in the prompt) offers only marginal improvements in reliability, with benefits plateauing quickly. Changes to prompt format has a small effect (smaller and reasoning models more sensitive).
October 20, 2025 at 1:53 PM
Finding 2: This disagreement has significant downstream consequences. Re-running the original analyses with LLM annotations produced highly variable coefficient estimates, often altering the conclusions of the original studies.
October 20, 2025 at 1:53 PM
There is also an interesting linear relationship between LLM-human and LLM-LLM annotation agreement: when LLMs agree more with each other, they also tend to agree more with humans and supervised models! We gave some suggestions on what annotation tasks are good for LLMs.
October 20, 2025 at 1:53 PM
Finding 1: LLM annotations show pretty low intercoder reliability with the original annotations (coded by humans or supervised models). Perhaps surprisingly, reliability among the different LLMs themselves is only moderate (larger models better).
October 20, 2025 at 1:53 PM