Manya Wadhwa
manyawadhwa.bsky.social
Manya Wadhwa
@manyawadhwa.bsky.social
EvalAgent (EA-Web) criteria are often non-obvious to humans and not easily met by LLMs out of the box, making them valuable for evaluation. We also show how the criteria generated by EvalAgent is highly actionable (results in paper)!
April 22, 2025 at 3:04 PM
We test criteria generated by EvalAgent across 9 datasets from creative writing to technical reports and compare criteria generated by 2 other systems!

Results? We show that the criteria generated by EvalAgent (EA-Web) are 🎯 highly specific and 💭 implicit.
April 22, 2025 at 3:04 PM
Take a prompt "Help me draft an academic talk on coffee intake vs research productivity." We know the output should be factual. But how do we identify less obvious features that are not in the prompt, like structure of the talk? That’s where EvalAgent steps in!
April 22, 2025 at 3:04 PM
EvalAgent uncovers criteria for evaluating LLM responses on open-ended tasks by:

📌 Decomposing the user prompt into key conceptual queries
🌐 Searching the web for expert advice and summarizing it
📋 Aggregating web-retrieved information into specific and actionable evaluation criteria
April 22, 2025 at 3:04 PM
Evaluating language model responses on open-ended tasks is hard! 🤔

We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️.

EvalAgent identifies 👩‍🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇
April 22, 2025 at 3:04 PM
We validate our task by checking if expert annotators agree on how NLEs from our dataset should be rescaled. We then show that GPT4 can discern subtleties in NLEs as well as humans do!! We also show that using a scoring rubric is very important for this rescaling!
November 16, 2023 at 2:55 AM
Let's consider the evaluation of LLM-generated answers for document-grounded, non-factoid question answering. LLMs like GPT-4 can do this really well! To evaluate LLM responses, annotators gave Likert ratings, and explained intricacies using natural language explanations (NLEs).
November 16, 2023 at 2:54 AM
Excited to share our updated preprint (w/
Jifan Chen, @jessyjli.bsky.social , @gregdnlp.bsky.social )

📜 arxiv.org/pdf/2305.147...

We show that LLMs can help understand nuances of annotation: they can convert the expressiveness of natural language explanations to a numerical form.

🧵
November 16, 2023 at 2:54 AM