Werner Geyer
banner
wernergeyer.bsky.social
Werner Geyer
@wernergeyer.bsky.social
Chief Scientist Human-Center Trustworthy AI @ IBM Research. Interested in Human+AI Interaction & AI-Assisted Productivity. Opinions are my own! https://wernergeyer.com
Reposted by Werner Geyer
❣️ Shout out to my amazing co-authors:
Rachel Ostrand, @wernergeyer.bsky.social , @keerthi166.bsky.social, Dennis Wei, and Justin Weisz!

If you'll be at AIES, I would love to connect and chat more about our work! 🙌
October 16, 2025 at 11:01 AM
6/ Try it out & explore more:
👉 GitHub: github.com/IBM/eval-ass...
👉 Demo: evalassist-evalassist.hf.space
👉 Project page: ibm.github.io/eval-assist/
September 25, 2025 at 5:56 PM
5/ And we’re planning to bring several backend capabilities into the UI soon. Stay tuned 👀
September 25, 2025 at 5:56 PM
4/ ⚙️ Backend updates
• Independent Judges module (no UI - see: github.com/IBM/eval-ass...)
• Unified Judge API
• Extensible: supports Unitxt, M-Prometheus & more
• Self-consistency: run judges multiple times
• In-context examples
• Multi-criteria evals w/roll-ups
• Custom prompts supported
eval-assist/backend/src/evalassist/judges at main · IBM/eval-assist
EvalAssist is an open-source project that simplifies using large language models as evaluators (LLM-as-a-Judge) of the output of other large language models by supporting users in iteratively refin...
github.com
September 25, 2025 at 5:56 PM
3/ 🖥️ UI updates
• Export & import test data (CSV)
• More benchmarks: JudgeBench & BigGen, grouped by capabilities
• 50+ Unitxt () criteria via Unixt (www.unitxt.ai) catalog integration
• Export/import test cases in JSON
• Model provider connections can be tested before evals
www.unitxt.ai
September 25, 2025 at 5:56 PM
2/ 📄 Paper @acmuist.bsky.social : EvalAssist: Insights on Task-Specific Evaluations and AI-Assisted Judgment Strategy Preferences
By @dohyojin.bsky.social - presenting Wed 9:00–10:30 in “Managing Tasks.” session
👉 arxiv.org/pdf/2410.00873
arxiv.org
September 25, 2025 at 5:56 PM
1/ EvalAssist makes it easier to test, refine & share evaluation criteria for LLMs. ibm.github.io/eval-assist/
We’ve added powerful new features on both the UI and backend, plus we’ll be at UIST next week presenting our paper on task-specific evaluations & AI-assisted judgment strategies.
EvalAssist
EvalAssist simplifies LLM-as-a-Judge by supporting users in iteratively refining evaluation criteria in a web-based user experience.
ibm.github.io
September 25, 2025 at 5:56 PM