Lightnews — Scholar-powered news

Mrinal Verghese

@mrinal-verghese.bsky.social

1.4K followers 680 following 10 posts

PhD student at Carnegie Mellon Robotics Institute.
I work on task learning for household robots.
He/Him.
http://mrinal.verghese.org

Posts Replies Media Videos

Mrinal Verghese

@mrinal-verghese.bsky.social

5/ 2) Grounding errors, where the LLM fails to recognize previously completed actions or suggests actions for a different variation of the task, are the dominant error modes. We can make progress in this domain by better enabling LLMs to attend to long visual histories.

A diagram showing the flow of a latte-making activity. The majority of errors made by the system are classified as "grounding errors".

February 23, 2025 at 10:07 PM

Mrinal Verghese

@mrinal-verghese.bsky.social

4/ 1) Encoding the visual task history using the Socratic approach is more effective than representing this info implicitly using VCLMs. Implicit representations capture “low-level” info, which is less useful for planning than the “high-level” info in explicit text representations.

A table showing two methods, Socratic 13B and VCLM 13B. The Socratic 13B method has a success rate of 27.8 and a mean intersection over union of 30.4. The VCLM has a success rate of 16.7 and a mean intersection over union of 23.0

February 23, 2025 at 10:07 PM

Mrinal Verghese

@mrinal-verghese.bsky.social

3/ We set up a user study where users would complete the first half of a task themselves while the LLM monitored their progress and then relied on the LLM to guide them through the rest of the task.
We came away with three important findings:

A diagram showing the system setup for evaluating Multimodal LLMs for activity assistance. A video stream from a user is fed to a Multimodal LLM, which generates a plant to complete an activity.

February 23, 2025 at 10:07 PM

Mrinal Verghese

@mrinal-verghese.bsky.social

How well do Multimodal LLMs consider visual information when creating plans to complete household activities? To answer this, we put a few multimodal LLMs on a pair of smart glasses and had participants try to solve cooking tasks while taking instructions from them.

February 23, 2025 at 10:07 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news