Manuel Gomez Rodriguez
autreche.bsky.social
Manuel Gomez Rodriguez
@autreche.bsky.social
Human-Centric Machine Learning at the Max Planck Institute for Software Systems
Reposted by Manuel Gomez Rodriguez
Heading to Rio de Janeiro 🇧🇷 for UAI 2025 (@auai.org) to present our tutorial with @tobigerstenberg.bsky.social and @autreche.bsky.social on "Counterfactuals in Minds and Machines" on Monday. Looking forward to this! If you are in Rio, let's meet!
July 18, 2025 at 12:38 PM
Check out an implementation of our model on several LLMs from the Llama family at github.com/Networks-Lea....

This has been a joint effort with multiple members of my group: Nina Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco & Suhas Thejaswi.
GitHub - Networks-Learning/coupled-llm-evaluation: Code for "Evaluation of Large Language Models via Coupled Token Generation", Arxiv 2025.
Code for "Evaluation of Large Language Models via Coupled Token Generation", Arxiv 2025. - Networks-Learning/coupled-llm-evaluation
github.com
February 5, 2025 at 8:33 AM
This suggests that the apparent advantage of a LLM over others in existing evaluation protocols may not be genuine but rather confounded by the randomness inherent to the generation process. Our model is easy to implement and does not require any finetuning/prompt engineering 5/
February 5, 2025 at 8:33 AM
On evaluations based on (human) pairwise comparisons, we show that coupled and standard autoregressive generation can surprisingly lead to different rankings when comparing more than two LLMs, even with an infinite amount of samples 4/
February 5, 2025 at 8:33 AM
On evaluations on benchmark datasets, we show that coupled autoregressive generation leads to the same conclusions as standard autoregressive generation but using provably fewer samples. For example, on MMLU, coupled autoregressive generation requires up to 40% fewer samples 3/
February 5, 2025 at 8:33 AM
Our key idea is to couple the autoregressive processes of a set of LLMs under comparison, particularly their samplers, by means of sharing the same source of randomness. Loosely speaking, coupled autoregressive generation ensures that no LLM will have better luck than others 2/
February 5, 2025 at 8:33 AM