Eddie Yang
eddieyang.bsky.social
Eddie Yang
@eddieyang.bsky.social
We also developed a new R package, localLLM (cran.r-project.org/package=loca...), that enables reproducible annotation using LLM directly in R. More functionalities to follow!
localLLM: Running Local LLMs with 'llama.cpp' Backend
The 'localLLM' package provides R bindings to the 'llama.cpp' library for running large language models. The package uses a lightweight architecture where the C++ backend library is downloaded at runt...
CRAN.R-project.org
October 20, 2025 at 1:57 PM
Based on these findings (and more in the paper), we offer recommendations for best practices. We also summarized the recs in a checklist to facilitate a more principled procedure.
October 20, 2025 at 1:57 PM
Finding 4: Bias-correction methods like DSL can reduce bias, but they introduce a trade-off: corrected estimates often have larger standard errors, requiring a large ground-truth sample (600-1000+) to be beneficial without losing too much precision.
October 20, 2025 at 1:57 PM
Finding 3: In-context learning (providing a few annotated examples in the prompt) offers only marginal improvements in reliability, with benefits plateauing quickly. Changes to prompt format has a small effect (smaller and reasoning models more sensitive).
October 20, 2025 at 1:57 PM
Finding 2: This disagreement has significant downstream consequences. Re-running the original analyses with LLM annotations produced highly variable coefficient estimates, often altering the conclusions of the original studies.
October 20, 2025 at 1:57 PM
There is also an interesting linear relationship between LLM-human and LLM-LLM annotation agreement: when LLMs agree more with each other, they also tend to agree more with humans and supervised models! We gave some suggestions on what annotation tasks are good for LLMs.
October 20, 2025 at 1:57 PM
Finding 1: LLM annotations show pretty low intercoder reliability with the original annotations (coded by humans or supervised models). Perhaps surprisingly, reliability among the different LLMs themselves is only moderate (larger models better).
October 20, 2025 at 1:57 PM
The LLM annotations also allowed us to present results on:
1. effectiveness of in-context learning
2. model sensitivity to changes in prompt format
3. bias-correction methods
October 20, 2025 at 1:57 PM
We re-annotated data from 14 published papers in political science with 15 different LLMs (300 million annotations!). We compared them with the original annotations. We then re-ran the original analyses to see how much variation in coefficient estimates these LLMs give us.
October 20, 2025 at 1:57 PM
We also developed a new R package, localLLM (cran.r-project.org/package=loca...), that enables reproducible annotation using LLM directly in R. More functionalities to follow!
localLLM: Running Local LLMs with 'llama.cpp' Backend
The 'localLLM' package provides R bindings to the 'llama.cpp' library for running large language models. The package uses a lightweight architecture where the C++ backend library is downloaded at runt...
CRAN.R-project.org
October 20, 2025 at 1:53 PM
Based on these findings (and more in the paper), we offer recommendations for best practices. We also summarized the recommendations in a checklist to facilitate a more principled procedure.
October 20, 2025 at 1:53 PM
Finding 4: Bias-correction methods like DSL can reduce bias, but they introduce a trade-off: corrected estimates often have larger standard errors, requiring a large ground-truth sample (600-1000+) to be beneficial without losing too much precision.
October 20, 2025 at 1:53 PM
Finding 3: In-context learning (providing a few annotated examples in the prompt) offers only marginal improvements in reliability, with benefits plateauing quickly. Changes to prompt format has a small effect (smaller and reasoning models more sensitive).
October 20, 2025 at 1:53 PM
Finding 2: This disagreement has significant downstream consequences. Re-running the original analyses with LLM annotations produced highly variable coefficient estimates, often altering the conclusions of the original studies.
October 20, 2025 at 1:53 PM
There is also an interesting linear relationship between LLM-human and LLM-LLM annotation agreement: when LLMs agree more with each other, they also tend to agree more with humans and supervised models! We gave some suggestions on what annotation tasks are good for LLMs.
October 20, 2025 at 1:53 PM
Finding 1: LLM annotations show pretty low intercoder reliability with the original annotations (coded by humans or supervised models). Perhaps surprisingly, reliability among the different LLMs themselves is only moderate (larger models better).
October 20, 2025 at 1:53 PM
The LLM annotations also allowed us to present results on:
1. effectiveness of in-context learning
2. model sensitivity to changes in prompt format
3. bias-correction methods
October 20, 2025 at 1:53 PM
We re-annotated data from 14 published papers in political science with 15 different LLMs (300 million annotations
!). We compared them with the original annotations. We then re-ran the original analyses to see how much variation in coefficient estimates these LLMs give us.
October 20, 2025 at 1:53 PM
I’ll just add that political science also has much to contribute to understanding AI itself: uncovering training data politics, designing mechanisms for info/preference aggregation, measurement (i.e. model evaluation) etc.
June 19, 2025 at 8:39 PM