Manya Wadhwa
manyawadhwa.bsky.social
Manya Wadhwa
@manyawadhwa.bsky.social
Work done with amazing collaborators: @zaynesprague.bsky.social, @cmalaviya.bsky.social, Philippe Laban, @jessyjli.bsky.social and @gregdnlp.bsky.social 🙌
April 22, 2025 at 3:04 PM
📝 Read the full paper: arxiv.org/pdf/2504.15219
💻 You can also use our system to generate criteria: github.com/ManyaWadhwa/...
Also checkout our 🎛️ UI to explore generated criteria + source URLs!
April 22, 2025 at 3:04 PM
Why do we need this? If you’ve used an LLM to draft a paper intro, research talk, or blog post, you’ve likely noticed that while the facts are correct, something feels off. What might be missing are the subtle cues and unspoken expectations. EvalAgent helps uncover and address those hidden layers! 🔮
April 22, 2025 at 3:04 PM
EvalAgent (EA-Web) criteria are often non-obvious to humans and not easily met by LLMs out of the box, making them valuable for evaluation. We also show how the criteria generated by EvalAgent is highly actionable (results in paper)!
April 22, 2025 at 3:04 PM
We test criteria generated by EvalAgent across 9 datasets from creative writing to technical reports and compare criteria generated by 2 other systems!

Results? We show that the criteria generated by EvalAgent (EA-Web) are 🎯 highly specific and 💭 implicit.
April 22, 2025 at 3:04 PM
For example, EvalAgent generates the following criteria for the academic talk prompt:

The response should have:
🪄 A compelling opening/motivation
🧠 Clear research question it answers
🏁 A strong conclusion that restates findings
April 22, 2025 at 3:04 PM
EvalAgent emulates how a human would seek advice by 🔍searching 🔍 things like “how to write a compelling talk”, reading expert tips from blogs and academic websites and aggregating that into specific, useful evaluation criteria.
April 22, 2025 at 3:04 PM
Take a prompt "Help me draft an academic talk on coffee intake vs research productivity." We know the output should be factual. But how do we identify less obvious features that are not in the prompt, like structure of the talk? That’s where EvalAgent steps in!
April 22, 2025 at 3:04 PM
EvalAgent uncovers criteria for evaluating LLM responses on open-ended tasks by:

📌 Decomposing the user prompt into key conceptual queries
🌐 Searching the web for expert advice and summarizing it
📋 Aggregating web-retrieved information into specific and actionable evaluation criteria
April 22, 2025 at 3:04 PM