Shiwali Mohan
shiwali.bsky.social
Shiwali Mohan
@shiwali.bsky.social
Brewing something exciting! | AI Scientist | Intelligent Agents & Multi-Agent Systems| Agent Frameworks & Architectures | Human-Agent Collaboration | Cognitive Science
Its a preliminary study but it shows how we can make #AI #ML evaluations more informative; beyond benchmarks curated with minimal insights about what a useful questions is and what an appropriate answer looks like. (8/8)

📖 Paper: arxiv.org/abs/2402.00234
🎥Talk: drive.google.com/file/d/1m79W...
Can Generative AI Support Patients' & Caregivers' Informational Needs? Towards Task-Centric Evaluation Of AI Systems
Generative AI systems such as ChatGPT and Claude are built upon language models that are typically evaluated for accuracy on curated benchmark datasets. Such evaluation paradigms measure predictive an...
arxiv.org
March 24, 2025 at 7:28 PM
🤖 Measured how #GenAI systems did; not only in terms of correctness but also how similar they were to an expert answering the same question. (7/8)
March 24, 2025 at 7:28 PM
📥 Curated an evaluation question set from observed interactions. The set contains real questions asked by participants as they were attempting to do a specific task. Such datasets are critical to measuring if an #AI system is producing responses that are useful. (6/8)
March 24, 2025 at 7:28 PM
🤕👩‍⚕️ Studied how people interact with the expert if they were available. This uncovered specific needs people have as they make sense of data and also, how an expert addresses those needs. (5/8)
March 24, 2025 at 7:28 PM
🏥 Identified a specific usecase in which people need support from an expert but the expert is not easily accessible; understanding medical scans and reports in order to make good decisions about your treatment. (4/8)
March 24, 2025 at 7:28 PM
In our most recent paper, we explore an evaluation approach for #GenAI #GenerativeAI systems. Here are the steps we followed - (3/8)
March 24, 2025 at 7:28 PM
As a science, we have to adopt rigorous evaluations that identify what #IntelligentSystem #AIAgent #Agent behavior should be & measure if it works as intended. Move beyond a 𝘱𝘳𝘰𝘣𝘭𝘦𝘮-𝘢𝘨𝘯𝘰𝘴𝘵𝘪𝘤 metric (accuracy) on a 𝘵𝘢𝘴𝘬-𝘢𝘨𝘰𝘯𝘴𝘵𝘪𝘤 benchmark. Adopt practices from #HCI, #psychology, #economics. (2/8)
March 24, 2025 at 7:28 PM