Alexi Gladstone
alexiglad.bsky.social
Alexi Gladstone
@alexiglad.bsky.social
PhD @ UIUC advised by Heng Ji. RS Intern @ Meta, previously @ Palantir, UVA. Working on SSL, world models, multimodal learning.

https://alexiglad.github.io/
[12/N] Website: energy-based-transformers.github.io
Paper: arxiv.org/abs/2507.02092
HF Daily Paper Page: huggingface.co/papers/2507....

We’re just getting started with EBMs, we see EBMs as a generalizing framework and anticipate them experiencing a surge in popularity!
Energy-Based Transformers: Outscaling Transformers and Generalizable Reasoning
Learn how Energy-Based Transformers (EBTs) enable improved scalability over traditional transformers while generalizing reasoning/thinking capabilities to be learned on any problem. #AI #DeepLearning ...
energy-based-transformers.github.io
July 7, 2025 at 8:33 PM
[11/N] ⛓️‍💥It’s common wisdom that “a chain is only as strong as its weakest link.”

Following this wisdom, we believe that each step in a chain of thought should receive sufficient computation to avoid failure “links” that result in bad reasoning, which EBTs enable.
July 7, 2025 at 8:33 PM
[10/N] We also compare EBTs to diffusion models on relatively toy image denoising tasks, where we observe that EBTs outperform diffusion models while using 99% less forward passes.

EBTs also learn better representations of images than diffusion models, achieving a ~10x higher ImageNet accuracy.
July 7, 2025 at 8:33 PM
[9/N] EBTs outscaling the Transformer++ also holds across modalities! We test this on video.📹

We think this performance improvement occurs because verification is often easier than generation and because EBTs can learn to express uncertainty in continuous spaces.
July 7, 2025 at 8:33 PM
[8/N] In line with these results, we also find that even with the same or worse pretraining performance EBTs usually perform better on downstream tasks than the feed-forward Transformer++, further suggesting improved generalization.🎯
July 7, 2025 at 8:33 PM
[7/N] 🧠We can also investigate the thinking capabilities of EBTs compared to the Transformer++ by increasing the amount of compute at inference time.

We find that EBTs can out-generalize the Transformer++ on Out-of-Distribution data by thinking longer and that Thinking also improves with scale.
July 7, 2025 at 8:33 PM
[6/N] Of particular note is the data scaling, where we consistently observe EBTs being more data-efficient than the Transformer++ by > 30%. This is especially important because frontier labs are saying we are now data-constrained and that more data-efficient algorithms are the bottleneck.
July 7, 2025 at 8:33 PM
[5/N] We compared autoregressive EBTs against the SOTA recipe (Transformer++) in language modeling. We observe that EBTs consistently scale at a higher rate than the Transformer++ with respect to data, batch size, depth, FLOPs, and parameters.📈
July 7, 2025 at 8:33 PM
[4/N] So if EBMs are so promising, why are they uncommon, and why haven’t they been used at scale?

EBMs have struggled to scale due to issues with stability and parallelization. Therefore, we create Transformers specifically for solving these issues, which we call Energy-Based Transformers (EBTs).
July 7, 2025 at 8:33 PM
[3/N] So what are EBMs?💭

EBMs learn to assign a scalar energy value denoting the compatibility of inputs.

Then, EBMs learn to optimize predictions to minimize this energy.

This allows EBMs to know when a problem is difficult (high energy), and adjust resources until a good solution is found.
July 7, 2025 at 8:33 PM
[2/N] 🤔So how can models learn to think from unsupervised learning?

It turns out that there’s an elegant solution:💡
Learn to verify predictions
Optimization predictions with respect to this verifier

This is exactly what Energy-Based Models (EBM) are! EBMs enable thinking longer and self-verifying.
July 7, 2025 at 8:33 PM
[1/N] First, how can we generalize reasoning/System 2 Thinking to any problem/modality?🧐

Current approaches rely on verifiable rewards, but humans are able to think on any problem.

To achieve such general thinking, we argue that models should learn to think directly from unsupervised learning.
arxiv.org
July 7, 2025 at 8:33 PM
[0/N]
TLDR:
- EBTs are the first model to outscale the Transformer++ during pretraining across modalities and with respect to data, parameters, FLOPs, depth, etc
- EBTs achieve a +29% improvement over the Transformer++ via thinking longer
- EBTs exhibit better generalization than existing models
July 7, 2025 at 8:33 PM