Rachel Ostrand, @wernergeyer.bsky.social , @keerthi166.bsky.social, Dennis Wei, and Justin Weisz!
If you'll be at AIES, I would love to connect and chat more about our work! 🙌
Rachel Ostrand, @wernergeyer.bsky.social , @keerthi166.bsky.social, Dennis Wei, and Justin Weisz!
If you'll be at AIES, I would love to connect and chat more about our work! 🙌
👉 GitHub: github.com/IBM/eval-ass...
👉 Demo: evalassist-evalassist.hf.space
👉 Project page: ibm.github.io/eval-assist/
👉 GitHub: github.com/IBM/eval-ass...
👉 Demo: evalassist-evalassist.hf.space
👉 Project page: ibm.github.io/eval-assist/
• Independent Judges module (no UI - see: github.com/IBM/eval-ass...)
• Unified Judge API
• Extensible: supports Unitxt, M-Prometheus & more
• Self-consistency: run judges multiple times
• In-context examples
• Multi-criteria evals w/roll-ups
• Custom prompts supported
• Independent Judges module (no UI - see: github.com/IBM/eval-ass...)
• Unified Judge API
• Extensible: supports Unitxt, M-Prometheus & more
• Self-consistency: run judges multiple times
• In-context examples
• Multi-criteria evals w/roll-ups
• Custom prompts supported
• Export & import test data (CSV)
• More benchmarks: JudgeBench & BigGen, grouped by capabilities
• 50+ Unitxt () criteria via Unixt (www.unitxt.ai) catalog integration
• Export/import test cases in JSON
• Model provider connections can be tested before evals
• Export & import test data (CSV)
• More benchmarks: JudgeBench & BigGen, grouped by capabilities
• 50+ Unitxt () criteria via Unixt (www.unitxt.ai) catalog integration
• Export/import test cases in JSON
• Model provider connections can be tested before evals
By @dohyojin.bsky.social - presenting Wed 9:00–10:30 in “Managing Tasks.” session
👉 arxiv.org/pdf/2410.00873
By @dohyojin.bsky.social - presenting Wed 9:00–10:30 in “Managing Tasks.” session
👉 arxiv.org/pdf/2410.00873
We’ve added powerful new features on both the UI and backend, plus we’ll be at UIST next week presenting our paper on task-specific evaluations & AI-assisted judgment strategies.
We’ve added powerful new features on both the UI and backend, plus we’ll be at UIST next week presenting our paper on task-specific evaluations & AI-assisted judgment strategies.