Mason Kamb
masonkamb.bsky.social
Mason Kamb
@masonkamb.bsky.social
I am curious if you have ever tried to compiling all of your disparate observations about the impacts of changing various hyperparameters in your models. Having followed your work for a bit, it seems like you have a wealth of knowledge about this that would be interesting to a lot of people.
July 5, 2025 at 6:32 AM
Also, see this explainer thread for more details:
bsky.app/profile/maso...
Excited to finally share this work w/ @suryaganguli.bsky.social Tl;dr: we find the first closed-form analytical theory that replicates the outputs of the very simplest diffusion models, with median pixel wise r^2 values of 90%+. arxiv.org/abs/2412.20292
June 30, 2025 at 4:20 PM
If you're interested, you can also:
- read our paper (now with faces!): arxiv.org/pdf/2412.202...
- use our code + weights:
github.com/Kambm/convol...
June 30, 2025 at 4:20 PM
Came for the political ripostes and stayed for the diffusion models
May 4, 2025 at 10:08 PM
Wow, thank you for this very charitable review! Happy to answer any questions/discussion points if you have them.

Code should be out soonish, working to bring the repo into a fit state for public consumption (currently it's a bit spaghettified). Colab not yet in the works, but perhaps it should be…
January 11, 2025 at 6:52 PM
*replicate for MNIST that is. Different datasets have different characteristics in this regard.
January 1, 2025 at 5:00 PM
Interesting question. On a patch level I don't have a specific answer. Formally at the largest scales the answer is probably "all of them." On a whole-image level I've found that you can approximately replicate the generated images you get with the whole dataset with only a few hundred examples.
January 1, 2025 at 4:53 PM
You're also never precisely at t=0 due to discretization, which mitigates the blowup issue as well.
January 1, 2025 at 4:50 PM
The NN generated outputs will not obey this consistency condition because they don't blow up. In practice this doesn't affect the output a whole lot. The intuition is that if you have a lot of patches into the dataset, the aforementioned consistency condition becomes very mild.
January 1, 2025 at 4:50 PM
Good question. The effect of this explosion for the ELS machine ends up being that it enforces the consistency condition in theorem 4.1 (each pixel should match the center pixel of the l2-nearest patch). Intuition here is that these are the only points where the score fails to explode.
January 1, 2025 at 4:50 PM
We’re excited to push the envelope of deep learning theory to encompass minimal examples of realistic diffusion models in this paper. We hope that this work will lay a foundation for detailed investigations into more sophisticated models, including those with self-attention.
December 31, 2024 at 4:00 PM
The images from the Attention-enabled model bear strong qualitative resemblance to the ELS machine, but exhibit *just enough* nonlocal coordination to be semantically meaningful.
December 31, 2024 at 4:00 PM
Our theory is tailored to models that have strong locality biases, such as CNNs. However, we find that our theory (bottom rows) is still moderately predictive for a simple diffusion model *with* self-Attention layers (top rows), which explicitly break equivariance/locality.
December 31, 2024 at 4:00 PM
Diffusion models are notorious for getting the wrong numbers of fingers, legs, etc. Our theory is able to recapitulate this behavior, and provides for the first time a clear mechanistic explanation for these failures as a consequence of excessive locality.
December 31, 2024 at 4:00 PM
This simple model of diffusion model creativity is remarkably predictive-- we find that, after calibrating a single time-dependent hyperparameter (the locality scale), we can replicate the behavior of trained fully-convolutional diffusion models on a case-by-case basis
December 31, 2024 at 4:00 PM
Under optimal *equivariant+local* denoising, each pixel can be drawn towards *any* training patch from *anywhere* in the training set, rather than only the ones that are drawn from the same pixel location. We call this model the Equivariant Local Score (ELS) Machine.
December 31, 2024 at 4:00 PM
Under optimal *local* denoising, each *pixel* forms an independent Bayesian estimate for the probability of each training example, based on the information visible in the receptive field, rather than the entire image.
December 31, 2024 at 4:00 PM