Lightnews — Scholar-powered news

Thaddäus Wiedemer

@thwiedemer.bsky.social

48 followers 110 following 8 posts

Intern at Google Deepmind Toronto | PhD student in ML at Max Planck Institute Tübingen and University of Tübingen.

Posts Replies Media Videos

Thaddäus Wiedemer

@thwiedemer.bsky.social

I'm truly honored to have worked on this at Google DeepMind with my amazing collaborators!

With 2 months left in my internship, I'm excited about our next steps in this direction!

September 25, 2025 at 5:02 PM

Thaddäus Wiedemer

@thwiedemer.bsky.social

And as with other 'zero-shot' works, it's clear that Veo has been exposed to samples of many of our tasks in the training data. The promise lies into its ability to quickly be adapted for general tasks with just a prompt, no fine-tuning required!

September 25, 2025 at 5:02 PM

Thaddäus Wiedemer

@thwiedemer.bsky.social

Of course, performance is not perfect yet and lacks behind SotA. Video models are also expensive to train and run, so they won't replace all vision models just yet. But the rapid progress from Veo 2 to Veo 3 illustrates their potential to become vision foundation models.

September 25, 2025 at 5:02 PM

Thaddäus Wiedemer

@thwiedemer.bsky.social

Intuitively, some tasks are easier to directly solve in the vision domain, and we also observe this in maze solving tasks. This makes me super excited about a future where generalist vision and language models could be integrated for reasoning in the real world by 'imagining' possible outcomes.

September 25, 2025 at 5:02 PM

Thaddäus Wiedemer

@thwiedemer.bsky.social

On the reasoning side, videos as 'chain-of-frames' parallel chain-of-thought in LLMs. Complex visual tasks that an image editing model like Nano Banana would have to solve in one go can be broken down into smaller steps.

September 25, 2025 at 5:02 PM

Thaddäus Wiedemer

@thwiedemer.bsky.social

Specifically, Veo 3 can perceive (segment, locacalize, detect edges, ...), model (physics, abstract relations, memory), manipulate (edit images, simulate robotics), and reason about the visual world.

Video models might well become vision foundation models.

September 25, 2025 at 5:02 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news