I work on task learning for household robots.
He/Him.
http://mrinal.verghese.org
We came away with three important findings:
We came away with three important findings:
Socratic Models convert vision to text using pretrained models such as narration models and pass it to an off-the-shelf LLM.
Vision-Conditioned Language Models (VCLMs) encode vision with pretrained encoders and pass the embeddings to a fine-tuned LLM.
Socratic Models convert vision to text using pretrained models such as narration models and pass it to an off-the-shelf LLM.
Vision-Conditioned Language Models (VCLMs) encode vision with pretrained encoders and pass the embeddings to a fine-tuned LLM.
This work, User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance, is being presented next weekend at #WACV2025.
Paper: www.arxiv.org/abs/2408.03160
Poster: Saturday March 1, Poster Session 2
Oral: Sunday March 2, Oral Session 5.4 Generative Models V 2:00 PM
This work, User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance, is being presented next weekend at #WACV2025.
Paper: www.arxiv.org/abs/2408.03160
Poster: Saturday March 1, Poster Session 2
Oral: Sunday March 2, Oral Session 5.4 Generative Models V 2:00 PM