Adeel Razi
@adeelrazi.bsky.social
Computational Neuroscientist, NeuroAI, Causality. Monash, UCL, CIFAR. Lab: https://comp-neuro.github.io/
Congratulations and looking forward to seeing what you do there!
September 24, 2025 at 6:48 AM
Congratulations and looking forward to seeing what you do there!
This is for our collaboration with @wellcomeleap.bsky.social on Untangling Addiction.
wellcomeleap.org/ua/program/
wellcomeleap.org/ua/program/
Untangling Addiction Program Details | Wellcome Leap: Unconventional Projects. Funded at Scale.
NEW $50M PROGRAMUntangling AddictionWe are pleased to announce the selected performers.Yasmin Hurd, Icahn School of Medicine at Mount SinaiBrett Ginsburg, University of Texas Health Science Center at ...
wellcomeleap.org
July 30, 2025 at 10:46 PM
This is for our collaboration with @wellcomeleap.bsky.social on Untangling Addiction.
wellcomeleap.org/ua/program/
wellcomeleap.org/ua/program/
That's really intersting and relevant, will read closely and cite in the related work. We currently cited this one fo binary NNs: arxiv.org/abs/2002.10778
Training Binary Neural Networks using the Bayesian Learning Rule
Neural networks with binary weights are computation-efficient and hardware-friendly, but their training is challenging because it involves a discrete optimization problem. Surprisingly, ignoring the d...
arxiv.org
May 27, 2025 at 10:40 AM
That's really intersting and relevant, will read closely and cite in the related work. We currently cited this one fo binary NNs: arxiv.org/abs/2002.10778
Reg batchnorm: it's effective in many settings, but can be brittle in others, like when used with small batch sizes, non-i.i.d. data or models with stochasticity in the forward pass. In these cases, the running estimates of mean/variance can drift or misalign with test-time behaviour.
2/2
2/2
May 27, 2025 at 7:49 AM
Reg batchnorm: it's effective in many settings, but can be brittle in others, like when used with small batch sizes, non-i.i.d. data or models with stochasticity in the forward pass. In these cases, the running estimates of mean/variance can drift or misalign with test-time behaviour.
2/2
2/2
Yes, absolutely, "noisy" was shorthand & it does depend on the surrogate. What I meant is that common surrogates can have high gradient variance, especially when their outputs saturate. That variance can hurt learning, particularly in deeper networks or those with binary/stochastic activations.
1/2
1/2
May 27, 2025 at 7:47 AM
Yes, absolutely, "noisy" was shorthand & it does depend on the surrogate. What I meant is that common surrogates can have high gradient variance, especially when their outputs saturate. That variance can hurt learning, particularly in deeper networks or those with binary/stochastic activations.
1/2
1/2
of course, whenever you could!
May 26, 2025 at 7:43 AM
of course, whenever you could!
Paper: arxiv.org/abs/2505.17962
We’d love feedback, extensions, or critiques.
@neuralreckoning.bsky.social @fzenke.bsky.social @wellingmax.bsky.social
#NeuroAI
6/6
We’d love feedback, extensions, or critiques.
@neuralreckoning.bsky.social @fzenke.bsky.social @wellingmax.bsky.social
#NeuroAI
6/6
A Principled Bayesian Framework for Training Binary and Spiking Neural Networks
We propose a Bayesian framework for training binary and spiking neural networks that achieves state-of-the-art performance without normalisation layers. Unlike commonly used surrogate gradient methods...
arxiv.org
May 26, 2025 at 4:04 AM
Paper: arxiv.org/abs/2505.17962
We’d love feedback, extensions, or critiques.
@neuralreckoning.bsky.social @fzenke.bsky.social @wellingmax.bsky.social
#NeuroAI
6/6
We’d love feedback, extensions, or critiques.
@neuralreckoning.bsky.social @fzenke.bsky.social @wellingmax.bsky.social
#NeuroAI
6/6
Why does KL divergence show up everywhere in machine learning?
Because it's not just a distance, it's the cost of believing your own model too much.
Minimizing KL = reducing surprise = optimizing variational free energy.
A silent principle behind robust inference.
5/6
Because it's not just a distance, it's the cost of believing your own model too much.
Minimizing KL = reducing surprise = optimizing variational free energy.
A silent principle behind robust inference.
5/6
May 26, 2025 at 4:04 AM
Why does KL divergence show up everywhere in machine learning?
Because it's not just a distance, it's the cost of believing your own model too much.
Minimizing KL = reducing surprise = optimizing variational free energy.
A silent principle behind robust inference.
5/6
Because it's not just a distance, it's the cost of believing your own model too much.
Minimizing KL = reducing surprise = optimizing variational free energy.
A silent principle behind robust inference.
5/6
Our key innovation:
- A family of importance-weighted straight-through estimators (IW-ST), which unify and generalize previous methods.
- No need for backprop-through-noise tricks.
- No batch norm.
Just clean, effective training.
4/6
- A family of importance-weighted straight-through estimators (IW-ST), which unify and generalize previous methods.
- No need for backprop-through-noise tricks.
- No batch norm.
Just clean, effective training.
4/6
May 26, 2025 at 4:04 AM
Our key innovation:
- A family of importance-weighted straight-through estimators (IW-ST), which unify and generalize previous methods.
- No need for backprop-through-noise tricks.
- No batch norm.
Just clean, effective training.
4/6
- A family of importance-weighted straight-through estimators (IW-ST), which unify and generalize previous methods.
- No need for backprop-through-noise tricks.
- No batch norm.
Just clean, effective training.
4/6
We view training as Bayesian inference, minimizing KL divergence between a posterior and an amortized prior.
This lets us derive a principled loss from first principles—grounded in variational free energy, not heuristics.
3/6
This lets us derive a principled loss from first principles—grounded in variational free energy, not heuristics.
3/6
May 26, 2025 at 4:04 AM
We view training as Bayesian inference, minimizing KL divergence between a posterior and an amortized prior.
This lets us derive a principled loss from first principles—grounded in variational free energy, not heuristics.
3/6
This lets us derive a principled loss from first principles—grounded in variational free energy, not heuristics.
3/6
Binary/spiking neural networks are efficient and brain-inspired—but notoriously difficult to train.
Why? Discrete activations → non-differentiable.
Most current methods either approximate gradients or add noisy surrogates.
We do something different.
2/6
Why? Discrete activations → non-differentiable.
Most current methods either approximate gradients or add noisy surrogates.
We do something different.
2/6
May 26, 2025 at 4:04 AM
Binary/spiking neural networks are efficient and brain-inspired—but notoriously difficult to train.
Why? Discrete activations → non-differentiable.
Most current methods either approximate gradients or add noisy surrogates.
We do something different.
2/6
Why? Discrete activations → non-differentiable.
Most current methods either approximate gradients or add noisy surrogates.
We do something different.
2/6
If brains infer control by predicting their own actions,
should future AI do the same?
Instead of optimizing over actions,
let’s build agents that explain their sensations.
Intelligence may not be about control—but coherence.
#AgencyByInference
should future AI do the same?
Instead of optimizing over actions,
let’s build agents that explain their sensations.
Intelligence may not be about control—but coherence.
#AgencyByInference
May 25, 2025 at 11:01 AM
If brains infer control by predicting their own actions,
should future AI do the same?
Instead of optimizing over actions,
let’s build agents that explain their sensations.
Intelligence may not be about control—but coherence.
#AgencyByInference
should future AI do the same?
Instead of optimizing over actions,
let’s build agents that explain their sensations.
Intelligence may not be about control—but coherence.
#AgencyByInference
Maybe intelligence isn’t about maximizing reward…
but minimizing surprise in a world we predictively model.
What if agency is not learned—but inferred?
but minimizing surprise in a world we predictively model.
What if agency is not learned—but inferred?
May 25, 2025 at 11:01 AM
Maybe intelligence isn’t about maximizing reward…
but minimizing surprise in a world we predictively model.
What if agency is not learned—but inferred?
but minimizing surprise in a world we predictively model.
What if agency is not learned—but inferred?