Lightnews — Scholar-powered news

Michael Tschannen

@mtschannen.bsky.social

870 followers 380 following 17 posts

Research Scientist @GoogleDeepMind. Representation learning for multimodal understanding and generation.

mitscha.github.io

Posts Replies Media Videos

Michael Tschannen

@mtschannen.bsky.social

HF model collection for transformers:
huggingface.co/collections/...

HF model collection for OpenCLIP and timm:
huggingface.co/collections/...

And of course big_vision checkpoints:
github.com/google-resea...

SigLIP2 - a google Collection

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

February 22, 2025 at 3:34 PM

Michael Tschannen

@mtschannen.bsky.social

Paper:
arxiv.org/abs/2502.14786

HF blog post from @arig23498.bsky.social et al. with a gentle intro to the training recipe and a demo:
huggingface.co/blog/siglip2

Thread with results overview from Xiaohua (only on X, sorry - these are all in the paper):
x.com/XiaohuaZhai/...

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training obje...

arxiv.org

February 22, 2025 at 3:34 PM

Michael Tschannen

@mtschannen.bsky.social

It’s not, good catch.

December 3, 2024 at 9:51 PM

Michael Tschannen

@mtschannen.bsky.social

Very nice! I knew some soft-token TTS papers, but none so far using AR + normalizing flows. Thanks for sharing!

December 3, 2024 at 9:44 AM

Michael Tschannen

@mtschannen.bsky.social

The noise curriculum guides the (image generation) learning process to first learn high-level, global structure and later low-level structure/texture. Maximum likelihood “tends to focus” mostly on the latter.

December 3, 2024 at 8:02 AM

Michael Tschannen

@mtschannen.bsky.social

Joint work with @asusanopinto.bsky.social and @kolesnikov.ch done @googledeepind.bsky.social.

8/8

December 2, 2024 at 4:41 PM

Michael Tschannen

@mtschannen.bsky.social

To our knowledge, JetFormer is the first model capable of generating high fidelity images and producing strong log-likelihood bounds.

So far we explored a simple setup (image/text pairs, no post-training), and hope JetFormer inspires more (visual) tokenizer-free models!

7/

December 2, 2024 at 4:41 PM

Michael Tschannen

@mtschannen.bsky.social

Finally, why getting rid of visual tokenizers/VQ-VAEs?
- They can induce information loss (e.g. small text)
- Removing specialized components was a key driver of recent progress (bitter lesson)
- Raw likelihoods are comparable across models (for hill climbing, scaling laws)

6/

December 2, 2024 at 4:41 PM

Michael Tschannen

@mtschannen.bsky.social

Importantly, this is simple additive Gaussian noise on the training images (i.e. a data augmentation). JetFormer does neither depend on it (or its parameters), nor is it trained for denoising like diffusion models.

5/

December 2, 2024 at 4:41 PM

Michael Tschannen

@mtschannen.bsky.social

Learning to generate high-fidelity images with maximum likelihood is tricky. To bias the model towards nicer-looking images we introduce a noise curriculum: Gaussian noise added to the input image and annealed to 0 during training, s.t. high-level details are learned first.

4/

December 2, 2024 at 4:41 PM

Michael Tschannen

@mtschannen.bsky.social

Conceptually, the normalizing flow serves as both an image encoder for perception tasks and an image decoder for image generation tasks during inference.

We train JetFormer to maximize the likelihood of the multimodal data, without auxiliary losses (perceptual or similar).

3/

December 2, 2024 at 4:41 PM

Michael Tschannen

@mtschannen.bsky.social

We leverage a normalizing flow (“jet”) to obtain a soft-token image representation that is end-to-end trained with a multimodal transformer for next-token prediction. The soft token distribution is modeled with a GMM à la GIVT.

arxiv.org/abs/2312.02116

2/

GIVT: Generative Infinite-Vocabulary Transformers

We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose t...

arxiv.org

December 2, 2024 at 4:41 PM

Michael Tschannen

@mtschannen.bsky.social

Thank you!

November 25, 2024 at 6:25 PM

Michael Tschannen

@mtschannen.bsky.social

🙋‍♂️

November 24, 2024 at 8:14 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news