Lightnews — Scholar-powered news

Mrinal Verghese

@mrinal-verghese.bsky.social

1.4K followers 680 following 10 posts

PhD student at Carnegie Mellon Robotics Institute.
I work on task learning for household robots.
He/Him.
http://mrinal.verghese.org

Posts Replies Media Videos

Mrinal Verghese

@mrinal-verghese.bsky.social

8/ A huge thank you goes out to my co-authors, Brian Chen, @heghbalz.bsky.social, Tushar Nagarajan, and Ruta Desai.

February 23, 2025 at 10:07 PM

Mrinal Verghese

@mrinal-verghese.bsky.social

7/ Finally, even though the overall success rate was low, in 50% of successful trials with our best model, the model guided a participant to complete an activity they had never done before. This highlights the potential of these systems to provide household assistance, particularly to elderly folks.

February 23, 2025 at 10:07 PM

Mrinal Verghese

@mrinal-verghese.bsky.social

6/ 3) Metrics from related offline benchmarks, like action anticipation, can be misleading and are not indicative of real-world performance. Check out our paper to see some of the errors we found with these metrics and how to conduct your own study!

February 23, 2025 at 10:07 PM

Mrinal Verghese

@mrinal-verghese.bsky.social

5/ 2) Grounding errors, where the LLM fails to recognize previously completed actions or suggests actions for a different variation of the task, are the dominant error modes. We can make progress in this domain by better enabling LLMs to attend to long visual histories.

A diagram showing the flow of a latte-making activity. The majority of errors made by the system are classified as "grounding errors".

February 23, 2025 at 10:07 PM

Mrinal Verghese

@mrinal-verghese.bsky.social

4/ 1) Encoding the visual task history using the Socratic approach is more effective than representing this info implicitly using VCLMs. Implicit representations capture “low-level” info, which is less useful for planning than the “high-level” info in explicit text representations.

A table showing two methods, Socratic 13B and VCLM 13B. The Socratic 13B method has a success rate of 27.8 and a mean intersection over union of 30.4. The VCLM has a success rate of 16.7 and a mean intersection over union of 23.0

February 23, 2025 at 10:07 PM

Mrinal Verghese

@mrinal-verghese.bsky.social

3/ We set up a user study where users would complete the first half of a task themselves while the LLM monitored their progress and then relied on the LLM to guide them through the rest of the task.
We came away with three important findings:

A diagram showing the system setup for evaluating Multimodal LLMs for activity assistance. A video stream from a user is fed to a Multimodal LLM, which generates a plant to complete an activity.

February 23, 2025 at 10:07 PM

Mrinal Verghese

@mrinal-verghese.bsky.social

2/ We tested two approaches:
Socratic Models convert vision to text using pretrained models such as narration models and pass it to an off-the-shelf LLM.
Vision-Conditioned Language Models (VCLMs) encode vision with pretrained encoders and pass the embeddings to a fine-tuned LLM.

February 23, 2025 at 10:07 PM

Mrinal Verghese

@mrinal-verghese.bsky.social

1/ Quick Info:
This work, User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance, is being presented next weekend at #WACV2025.

Paper: www.arxiv.org/abs/2408.03160
Poster: Saturday March 1, Poster Session 2
Oral: Sunday March 2, Oral Session 5.4 Generative Models V 2:00 PM

User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance

Our research investigates the capability of modern multimodal reasoning models, powered by Large Language Models (LLMs), to facilitate vision-powered assistants for multi-step daily activities. Such a...

www.arxiv.org

February 23, 2025 at 10:07 PM

Mrinal Verghese

@mrinal-verghese.bsky.social

Hello!

November 11, 2024 at 6:55 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news