Mrinal Verghese
mrinal-verghese.bsky.social
Mrinal Verghese
@mrinal-verghese.bsky.social
PhD student at Carnegie Mellon Robotics Institute.
I work on task learning for household robots.
He/Him.
http://mrinal.verghese.org
8/ A huge thank you goes out to my co-authors, Brian Chen, @heghbalz.bsky.social, Tushar Nagarajan, and Ruta Desai.
February 23, 2025 at 10:07 PM
7/ Finally, even though the overall success rate was low, in 50% of successful trials with our best model, the model guided a participant to complete an activity they had never done before. This highlights the potential of these systems to provide household assistance, particularly to elderly folks.
February 23, 2025 at 10:07 PM
6/ 3) Metrics from related offline benchmarks, like action anticipation, can be misleading and are not indicative of real-world performance. Check out our paper to see some of the errors we found with these metrics and how to conduct your own study!
February 23, 2025 at 10:07 PM
5/ 2) Grounding errors, where the LLM fails to recognize previously completed actions or suggests actions for a different variation of the task, are the dominant error modes. We can make progress in this domain by better enabling LLMs to attend to long visual histories.
February 23, 2025 at 10:07 PM
4/ 1) Encoding the visual task history using the Socratic approach is more effective than representing this info implicitly using VCLMs. Implicit representations capture “low-level” info, which is less useful for planning than the “high-level” info in explicit text representations.
February 23, 2025 at 10:07 PM
3/ We set up a user study where users would complete the first half of a task themselves while the LLM monitored their progress and then relied on the LLM to guide them through the rest of the task.
We came away with three important findings:
February 23, 2025 at 10:07 PM
2/ We tested two approaches:
Socratic Models convert vision to text using pretrained models such as narration models and pass it to an off-the-shelf LLM.
Vision-Conditioned Language Models (VCLMs) encode vision with pretrained encoders and pass the embeddings to a fine-tuned LLM.
February 23, 2025 at 10:07 PM
1/ Quick Info:
This work, User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance, is being presented next weekend at #WACV2025.

Paper: www.arxiv.org/abs/2408.03160
Poster: Saturday March 1, Poster Session 2
Oral: Sunday March 2, Oral Session 5.4 Generative Models V 2:00 PM
User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance
Our research investigates the capability of modern multimodal reasoning models, powered by Large Language Models (LLMs), to facilitate vision-powered assistants for multi-step daily activities. Such a...
www.arxiv.org
February 23, 2025 at 10:07 PM
Hello!
November 11, 2024 at 6:55 PM