Lightnews — Scholar-powered news

Werner Geyer

@wernergeyer.bsky.social

160 followers 210 following 34 posts

Chief Scientist Human-Center Trustworthy AI @ IBM Research. Interested in Human+AI Interaction & AI-Assisted Productivity. Opinions are my own! https://wernergeyer.com

Posts Replies Media Videos

Reposted by Werner Geyer

Hyo Jin (Gina) Do

@dohyojin.bsky.social

❣️ Shout out to my amazing co-authors:
Rachel Ostrand, @wernergeyer.bsky.social , @keerthi166.bsky.social, Dennis Wei, and Justin Weisz!

If you'll be at AIES, I would love to connect and chat more about our work! 🙌

October 16, 2025 at 11:01 AM

Werner Geyer

@wernergeyer.bsky.social

6/ Try it out & explore more:
👉 GitHub: github.com/IBM/eval-ass...
👉 Demo: evalassist-evalassist.hf.space
👉 Project page: ibm.github.io/eval-assist/

September 25, 2025 at 5:56 PM

Werner Geyer

@wernergeyer.bsky.social

5/ And we’re planning to bring several backend capabilities into the UI soon. Stay tuned 👀

September 25, 2025 at 5:56 PM

Werner Geyer

@wernergeyer.bsky.social

4/ ⚙️ Backend updates
• Independent Judges module (no UI - see: github.com/IBM/eval-ass...)
• Unified Judge API
• Extensible: supports Unitxt, M-Prometheus & more
• Self-consistency: run judges multiple times
• In-context examples
• Multi-criteria evals w/roll-ups
• Custom prompts supported

eval-assist/backend/src/evalassist/judges at main · IBM/eval-assist

EvalAssist is an open-source project that simplifies using large language models as evaluators (LLM-as-a-Judge) of the output of other large language models by supporting users in iteratively refin...

github.com

September 25, 2025 at 5:56 PM

Werner Geyer

@wernergeyer.bsky.social

3/ 🖥️ UI updates
• Export & import test data (CSV)
• More benchmarks: JudgeBench & BigGen, grouped by capabilities
• 50+ Unitxt () criteria via Unixt (www.unitxt.ai) catalog integration
• Export/import test cases in JSON
• Model provider connections can be tested before evals

www.unitxt.ai

September 25, 2025 at 5:56 PM

Werner Geyer

@wernergeyer.bsky.social

2/ 📄 Paper @acmuist.bsky.social : EvalAssist: Insights on Task-Specific Evaluations and AI-Assisted Judgment Strategy Preferences
By @dohyojin.bsky.social - presenting Wed 9:00–10:30 in “Managing Tasks.” session
👉 arxiv.org/pdf/2410.00873

arxiv.org

September 25, 2025 at 5:56 PM

Werner Geyer

@wernergeyer.bsky.social

1/ EvalAssist makes it easier to test, refine & share evaluation criteria for LLMs. ibm.github.io/eval-assist/
We’ve added powerful new features on both the UI and backend, plus we’ll be at UIST next week presenting our paper on task-specific evaluations & AI-assisted judgment strategies.

EvalAssist

EvalAssist simplifies LLM-as-a-Judge by supporting users in iteratively refining evaluation criteria in a web-based user experience.

ibm.github.io

September 25, 2025 at 5:56 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news