Adeel Razi
banner
adeelrazi.bsky.social
Adeel Razi
@adeelrazi.bsky.social
Computational Neuroscientist, NeuroAI, Causality. Monash, UCL, CIFAR. Lab: https://comp-neuro.github.io/
Congratulations and looking forward to seeing what you do there!
September 24, 2025 at 6:48 AM
That's really intersting and relevant, will read closely and cite in the related work. We currently cited this one fo binary NNs: arxiv.org/abs/2002.10778
Training Binary Neural Networks using the Bayesian Learning Rule
Neural networks with binary weights are computation-efficient and hardware-friendly, but their training is challenging because it involves a discrete optimization problem. Surprisingly, ignoring the d...
arxiv.org
May 27, 2025 at 10:40 AM
Reg batchnorm: it's effective in many settings, but can be brittle in others, like when used with small batch sizes, non-i.i.d. data or models with stochasticity in the forward pass. In these cases, the running estimates of mean/variance can drift or misalign with test-time behaviour.

2/2
May 27, 2025 at 7:49 AM
Yes, absolutely, "noisy" was shorthand & it does depend on the surrogate. What I meant is that common surrogates can have high gradient variance, especially when their outputs saturate. That variance can hurt learning, particularly in deeper networks or those with binary/stochastic activations.
1/2
May 27, 2025 at 7:47 AM
of course, whenever you could!
May 26, 2025 at 7:43 AM
Why does KL divergence show up everywhere in machine learning?

Because it's not just a distance, it's the cost of believing your own model too much.

Minimizing KL = reducing surprise = optimizing variational free energy.

A silent principle behind robust inference.

5/6
May 26, 2025 at 4:04 AM
Our key innovation:

- A family of importance-weighted straight-through estimators (IW-ST), which unify and generalize previous methods.
- No need for backprop-through-noise tricks.
- No batch norm.

Just clean, effective training.

4/6
May 26, 2025 at 4:04 AM
We view training as Bayesian inference, minimizing KL divergence between a posterior and an amortized prior.

This lets us derive a principled loss from first principles—grounded in variational free energy, not heuristics.

3/6
May 26, 2025 at 4:04 AM
Binary/spiking neural networks are efficient and brain-inspired—but notoriously difficult to train.

Why? Discrete activations → non-differentiable.

Most current methods either approximate gradients or add noisy surrogates.

We do something different.

2/6
May 26, 2025 at 4:04 AM
If brains infer control by predicting their own actions,
should future AI do the same?

Instead of optimizing over actions,
let’s build agents that explain their sensations.

Intelligence may not be about control—but coherence.

#AgencyByInference
May 25, 2025 at 11:01 AM
Maybe intelligence isn’t about maximizing reward…
but minimizing surprise in a world we predictively model.

What if agency is not learned—but inferred?
May 25, 2025 at 11:01 AM