Arduin Findeis
arduin.io
Arduin Findeis
@arduin.io
Working on evaluation of AI models (via human and AI feedback) | PhD candidate @cst.cam.ac.uk

Web: https://arduin.io
Github: https://github.com/rdnfn
Latest project: https://app.feedbackforensics.com
Pinned
🕵🏻💬 Introducing Feedback Forensics: a new tool to investigate pairwise preference data.

Feedback data is notoriously difficult to interpret and has many known issues – our app aims to help!

Try it at app.feedbackforensics.com

Three example use-cases 👇🧵
Reposted by Arduin Findeis
Can AI simulate human behavior? 🧠
The promise is revolutionary for science & policy. But there’s a huge "IF": Do these simulations actually reflect reality?
To find out, we introduce SimBench: The first large-scale benchmark for group-level social simulation. (1/9)
October 28, 2025 at 4:54 PM
👋 I'll be at #ACL2025 presenting research from my Apple internship! Our poster is titled: "Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?"

☞ Let's meet: come by our poster on Tuesday (29/7), 10:30 - 12:00, Hall 4/5, or DM me to set up a meeting!

✍︎ Paper link below ↓
Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?
Pairwise preferences over model responses are widely collected to evaluate and provide feedback to large language models (LLMs). Given two…
machinelearning.apple.com
July 27, 2025 at 3:22 PM
Excited to be in Singapore for ICLR! Keen to chat about interpreting feedback data and detecting model characteristics ⚖️

Reach out or come by our poster on Inverse Constitutional AI on Friday 25 April from 10am-12.30pm (#520 in Hall 2B) - @timokauf.bsky.social and I will be there!
April 24, 2025 at 3:47 PM
How exactly was the initial Chatbot Arena version of Llama 4 Maverick different from the public HuggingFace version?🕵️

I used our Feedback Forensics app to quantitatively analyse how exactly these two models differ. An overview…👇🧵
April 17, 2025 at 1:55 PM
🕵🏻💬 Introducing Feedback Forensics: a new tool to investigate pairwise preference data.

Feedback data is notoriously difficult to interpret and has many known issues – our app aims to help!

Try it at app.feedbackforensics.com

Three example use-cases 👇🧵
March 17, 2025 at 6:12 PM