https://alexiglad.github.io/
Paper: arxiv.org/abs/2507.02092
HF Daily Paper Page: huggingface.co/papers/2507....
We’re just getting started with EBMs, we see EBMs as a generalizing framework and anticipate them experiencing a surge in popularity!
Paper: arxiv.org/abs/2507.02092
HF Daily Paper Page: huggingface.co/papers/2507....
We’re just getting started with EBMs, we see EBMs as a generalizing framework and anticipate them experiencing a surge in popularity!
Following this wisdom, we believe that each step in a chain of thought should receive sufficient computation to avoid failure “links” that result in bad reasoning, which EBTs enable.
Following this wisdom, we believe that each step in a chain of thought should receive sufficient computation to avoid failure “links” that result in bad reasoning, which EBTs enable.
EBTs also learn better representations of images than diffusion models, achieving a ~10x higher ImageNet accuracy.
EBTs also learn better representations of images than diffusion models, achieving a ~10x higher ImageNet accuracy.
We think this performance improvement occurs because verification is often easier than generation and because EBTs can learn to express uncertainty in continuous spaces.
We think this performance improvement occurs because verification is often easier than generation and because EBTs can learn to express uncertainty in continuous spaces.
We find that EBTs can out-generalize the Transformer++ on Out-of-Distribution data by thinking longer and that Thinking also improves with scale.
We find that EBTs can out-generalize the Transformer++ on Out-of-Distribution data by thinking longer and that Thinking also improves with scale.
EBMs have struggled to scale due to issues with stability and parallelization. Therefore, we create Transformers specifically for solving these issues, which we call Energy-Based Transformers (EBTs).
EBMs have struggled to scale due to issues with stability and parallelization. Therefore, we create Transformers specifically for solving these issues, which we call Energy-Based Transformers (EBTs).
EBMs learn to assign a scalar energy value denoting the compatibility of inputs.
Then, EBMs learn to optimize predictions to minimize this energy.
This allows EBMs to know when a problem is difficult (high energy), and adjust resources until a good solution is found.
EBMs learn to assign a scalar energy value denoting the compatibility of inputs.
Then, EBMs learn to optimize predictions to minimize this energy.
This allows EBMs to know when a problem is difficult (high energy), and adjust resources until a good solution is found.
It turns out that there’s an elegant solution:💡
Learn to verify predictions
Optimization predictions with respect to this verifier
This is exactly what Energy-Based Models (EBM) are! EBMs enable thinking longer and self-verifying.
It turns out that there’s an elegant solution:💡
Learn to verify predictions
Optimization predictions with respect to this verifier
This is exactly what Energy-Based Models (EBM) are! EBMs enable thinking longer and self-verifying.
Current approaches rely on verifiable rewards, but humans are able to think on any problem.
To achieve such general thinking, we argue that models should learn to think directly from unsupervised learning.
Current approaches rely on verifiable rewards, but humans are able to think on any problem.
To achieve such general thinking, we argue that models should learn to think directly from unsupervised learning.
TLDR:
- EBTs are the first model to outscale the Transformer++ during pretraining across modalities and with respect to data, parameters, FLOPs, depth, etc
- EBTs achieve a +29% improvement over the Transformer++ via thinking longer
- EBTs exhibit better generalization than existing models
TLDR:
- EBTs are the first model to outscale the Transformer++ during pretraining across modalities and with respect to data, parameters, FLOPs, depth, etc
- EBTs achieve a +29% improvement over the Transformer++ via thinking longer
- EBTs exhibit better generalization than existing models