Lightnews — Scholar-powered news

Shoubin Yu

@shoubin.bsky.social

52 followers 450 following 10 posts

Ph.D. Student at UNC CS. Interested in multimodal video understanding&generation.

https://yui010206.github.io/

Posts Replies Media Videos

Shoubin Yu

@shoubin.bsky.social

Thanks to all of my amazing co-authors from Adobe Research, UMichi, UNC

Difan Liu (co-lead), @marstin.bsky.social (co-led)
Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, @mohitbansal.bsky.social

Check out more details on our homepage/paper.
Website: veggie-gen.github.io

VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation

veggie-gen.github.io

March 19, 2025 at 6:56 PM

Shoubin Yu

@shoubin.bsky.social

We further find that VEGGIE shows emergent zero-shot multimodal instruction following and in-context video editing ability, which may facilitate a broader range of future applications.

March 19, 2025 at 6:56 PM

Shoubin Yu

@shoubin.bsky.social

We project grounded queries into 2D spaces with PCA & t-SNE. We found Reasoning and Grounding cluster together, while Color, Env, and Change are closely grouped. Addition aligns with Reasoning and Grounding, suggesting addition involves semantic processes, while Removal is a more independent task.

March 19, 2025 at 6:56 PM

Shoubin Yu

@shoubin.bsky.social

we evaluate 7 different models on VEG-Bench across 8 distinct editing skills. Overall, VEGGIE demonstrates the best performance among instructional video editing models.

March 19, 2025 at 6:56 PM

Shoubin Yu

@shoubin.bsky.social

To further support our training, we also introduce a novel automatic instructional video data generation pipeline that lifts high-quality instructional image editing data into the video domain using image-to-video and video evaluation tools.

March 19, 2025 at 6:56 PM

Shoubin Yu

@shoubin.bsky.social

VEGGIE first leverages an MLLM to interpret complex instructions, generating frame-wise conditions, and then a video diffusion model is applied to reflect these conditions at the pixel space. Such continuous, learnable task query embeddings enable end-to-end training & capture task representations.

March 19, 2025 at 6:56 PM

Shoubin Yu

@shoubin.bsky.social

Existing video editing methods fall short of the goal of a simple, versatile video editor, requiring multiple models, complex pipelines, or extra caption/layout/human guidance. We introduce VEGGIE which formulates diverse editing tasks as end-to-end grounded generation in pixel space.

March 19, 2025 at 6:56 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news