Lightnews — Scholar-powered news

Eddie Yang

@eddieyang.bsky.social

We also developed a new R package, localLLM (cran.r-project.org/package=loca...), that enables reproducible annotation using LLM directly in R. More functionalities to follow!

localLLM: Running Local LLMs with 'llama.cpp' Backend

The 'localLLM' package provides R bindings to the 'llama.cpp' library for running large language models. The package uses a lightweight architecture where the C++ backend library is downloaded at runt...

CRAN.R-project.org

October 20, 2025 at 1:57 PM

Eddie Yang

@eddieyang.bsky.social

Based on these findings (and more in the paper), we offer recommendations for best practices. We also summarized the recs in a checklist to facilitate a more principled procedure.

October 20, 2025 at 1:57 PM

Eddie Yang

@eddieyang.bsky.social

Finding 4: Bias-correction methods like DSL can reduce bias, but they introduce a trade-off: corrected estimates often have larger standard errors, requiring a large ground-truth sample (600-1000+) to be beneficial without losing too much precision.

October 20, 2025 at 1:57 PM

Eddie Yang

@eddieyang.bsky.social

Finding 3: In-context learning (providing a few annotated examples in the prompt) offers only marginal improvements in reliability, with benefits plateauing quickly. Changes to prompt format has a small effect (smaller and reasoning models more sensitive).

October 20, 2025 at 1:57 PM

Eddie Yang

@eddieyang.bsky.social

Finding 2: This disagreement has significant downstream consequences. Re-running the original analyses with LLM annotations produced highly variable coefficient estimates, often altering the conclusions of the original studies.

October 20, 2025 at 1:57 PM

Eddie Yang

@eddieyang.bsky.social

There is also an interesting linear relationship between LLM-human and LLM-LLM annotation agreement: when LLMs agree more with each other, they also tend to agree more with humans and supervised models! We gave some suggestions on what annotation tasks are good for LLMs.

October 20, 2025 at 1:57 PM

Eddie Yang

@eddieyang.bsky.social

Finding 1: LLM annotations show pretty low intercoder reliability with the original annotations (coded by humans or supervised models). Perhaps surprisingly, reliability among the different LLMs themselves is only moderate (larger models better).

October 20, 2025 at 1:57 PM

Eddie Yang

@eddieyang.bsky.social

The LLM annotations also allowed us to present results on:
1. effectiveness of in-context learning
2. model sensitivity to changes in prompt format
3. bias-correction methods

October 20, 2025 at 1:57 PM

Eddie Yang

@eddieyang.bsky.social

We re-annotated data from 14 published papers in political science with 15 different LLMs (300 million annotations!). We compared them with the original annotations. We then re-ran the original analyses to see how much variation in coefficient estimates these LLMs give us.

October 20, 2025 at 1:57 PM

Eddie Yang

@eddieyang.bsky.social

We also developed a new R package, localLLM (cran.r-project.org/package=loca...), that enables reproducible annotation using LLM directly in R. More functionalities to follow!

localLLM: Running Local LLMs with 'llama.cpp' Backend

The 'localLLM' package provides R bindings to the 'llama.cpp' library for running large language models. The package uses a lightweight architecture where the C++ backend library is downloaded at runt...

CRAN.R-project.org

October 20, 2025 at 1:53 PM

Eddie Yang

@eddieyang.bsky.social

Based on these findings (and more in the paper), we offer recommendations for best practices. We also summarized the recommendations in a checklist to facilitate a more principled procedure.

October 20, 2025 at 1:53 PM

Eddie Yang

@eddieyang.bsky.social

Finding 4: Bias-correction methods like DSL can reduce bias, but they introduce a trade-off: corrected estimates often have larger standard errors, requiring a large ground-truth sample (600-1000+) to be beneficial without losing too much precision.

October 20, 2025 at 1:53 PM

Eddie Yang

@eddieyang.bsky.social

Finding 3: In-context learning (providing a few annotated examples in the prompt) offers only marginal improvements in reliability, with benefits plateauing quickly. Changes to prompt format has a small effect (smaller and reasoning models more sensitive).

October 20, 2025 at 1:53 PM

Eddie Yang

@eddieyang.bsky.social

Finding 2: This disagreement has significant downstream consequences. Re-running the original analyses with LLM annotations produced highly variable coefficient estimates, often altering the conclusions of the original studies.

October 20, 2025 at 1:53 PM

Eddie Yang

@eddieyang.bsky.social

There is also an interesting linear relationship between LLM-human and LLM-LLM annotation agreement: when LLMs agree more with each other, they also tend to agree more with humans and supervised models! We gave some suggestions on what annotation tasks are good for LLMs.

October 20, 2025 at 1:53 PM

Eddie Yang

@eddieyang.bsky.social

Finding 1: LLM annotations show pretty low intercoder reliability with the original annotations (coded by humans or supervised models). Perhaps surprisingly, reliability among the different LLMs themselves is only moderate (larger models better).

October 20, 2025 at 1:53 PM

Eddie Yang

@eddieyang.bsky.social

The LLM annotations also allowed us to present results on:
1. effectiveness of in-context learning
2. model sensitivity to changes in prompt format
3. bias-correction methods

October 20, 2025 at 1:53 PM

Eddie Yang

@eddieyang.bsky.social

We re-annotated data from 14 published papers in political science with 15 different LLMs (300 million annotations
!). We compared them with the original annotations. We then re-ran the original analyses to see how much variation in coefficient estimates these LLMs give us.

October 20, 2025 at 1:53 PM

Eddie Yang

@eddieyang.bsky.social

I’ll just add that political science also has much to contribute to understanding AI itself: uncovering training data politics, designing mechanisms for info/preference aggregation, measurement (i.e. model evaluation) etc.

June 19, 2025 at 8:39 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news