Shoubin Yu
@shoubin.bsky.social
Ph.D. Student at UNC CS. Interested in multimodal video understanding&generation.
https://yui010206.github.io/
https://yui010206.github.io/
We further find that VEGGIE shows emergent zero-shot multimodal instruction following and in-context video editing ability, which may facilitate a broader range of future applications.
March 19, 2025 at 6:56 PM
We further find that VEGGIE shows emergent zero-shot multimodal instruction following and in-context video editing ability, which may facilitate a broader range of future applications.
We project grounded queries into 2D spaces with PCA & t-SNE. We found Reasoning and Grounding cluster together, while Color, Env, and Change are closely grouped. Addition aligns with Reasoning and Grounding, suggesting addition involves semantic processes, while Removal is a more independent task.
March 19, 2025 at 6:56 PM
We project grounded queries into 2D spaces with PCA & t-SNE. We found Reasoning and Grounding cluster together, while Color, Env, and Change are closely grouped. Addition aligns with Reasoning and Grounding, suggesting addition involves semantic processes, while Removal is a more independent task.
we evaluate 7 different models on VEG-Bench across 8 distinct editing skills. Overall, VEGGIE demonstrates the best performance among instructional video editing models.
March 19, 2025 at 6:56 PM
we evaluate 7 different models on VEG-Bench across 8 distinct editing skills. Overall, VEGGIE demonstrates the best performance among instructional video editing models.
To further support our training, we also introduce a novel automatic instructional video data generation pipeline that lifts high-quality instructional image editing data into the video domain using image-to-video and video evaluation tools.
March 19, 2025 at 6:56 PM
To further support our training, we also introduce a novel automatic instructional video data generation pipeline that lifts high-quality instructional image editing data into the video domain using image-to-video and video evaluation tools.
VEGGIE first leverages an MLLM to interpret complex instructions, generating frame-wise conditions, and then a video diffusion model is applied to reflect these conditions at the pixel space. Such continuous, learnable task query embeddings enable end-to-end training & capture task representations.
March 19, 2025 at 6:56 PM
VEGGIE first leverages an MLLM to interpret complex instructions, generating frame-wise conditions, and then a video diffusion model is applied to reflect these conditions at the pixel space. Such continuous, learnable task query embeddings enable end-to-end training & capture task representations.
Introducing VEGGIE 🥦—a unified, end-to-end, and versatile instructional video generative model.
VEGGIE supports 8 skills, from object addition/removal/changing, and stylization to concept grounding/reasoning. It exceeds SoTA and shows 0-shot multimodal instructional & in-context video editing.
VEGGIE supports 8 skills, from object addition/removal/changing, and stylization to concept grounding/reasoning. It exceeds SoTA and shows 0-shot multimodal instructional & in-context video editing.
March 19, 2025 at 6:56 PM
Introducing VEGGIE 🥦—a unified, end-to-end, and versatile instructional video generative model.
VEGGIE supports 8 skills, from object addition/removal/changing, and stylization to concept grounding/reasoning. It exceeds SoTA and shows 0-shot multimodal instructional & in-context video editing.
VEGGIE supports 8 skills, from object addition/removal/changing, and stylization to concept grounding/reasoning. It exceeds SoTA and shows 0-shot multimodal instructional & in-context video editing.