Thaddäus Wiedemer
thwiedemer.bsky.social
Thaddäus Wiedemer
@thwiedemer.bsky.social
Intern at Google Deepmind Toronto | PhD student in ML at Max Planck Institute Tübingen and University of Tübingen.
Intuitively, some tasks are easier to directly solve in the vision domain, and we also observe this in maze solving tasks. This makes me super excited about a future where generalist vision and language models could be integrated for reasoning in the real world by 'imagining' possible outcomes.
September 25, 2025 at 5:02 PM
On the reasoning side, videos as 'chain-of-frames' parallel chain-of-thought in LLMs. Complex visual tasks that an image editing model like Nano Banana would have to solve in one go can be broken down into smaller steps.
September 25, 2025 at 5:02 PM
Specifically, Veo 3 can perceive (segment, locacalize, detect edges, ...), model (physics, abstract relations, memory), manipulate (edit images, simulate robotics), and reason about the visual world.

Video models might well become vision foundation models.
September 25, 2025 at 5:02 PM
Are we experiencing a 'GPT moment' in vision?

In our new preprint, we show that generative video models can solve a wide range of tasks across the entire vision stack without being explicitly trained for it.

🌐 video-zero-shot.github.io

1/n
September 25, 2025 at 5:02 PM