AOL preconditioning (fused + re-tuned) -> 1 iter saved
Better convergence, singular values closer to 1
Kernel tweak removes extra memory load
This gives ~1.6x speedup, ~3x vs plain torch. 🚀
AOL preconditioning (fused + re-tuned) -> 1 iter saved
Better convergence, singular values closer to 1
Kernel tweak removes extra memory load
This gives ~1.6x speedup, ~3x vs plain torch. 🚀
Wrote a small kernel to optimize bandwidth ->more free speed.
Wrote a small kernel to optimize bandwidth ->more free speed.
Fix: fuse AOL's operation with an existing NS step -> essentially free.
Problem 2: NS isn’t tuned for "almost orthogonal" inputs.
Fix: re-tune parameters with a genetic algorithm that is aware of the preconditioning.
Fix: fuse AOL's operation with an existing NS step -> essentially free.
Problem 2: NS isn’t tuned for "almost orthogonal" inputs.
Fix: re-tune parameters with a genetic algorithm that is aware of the preconditioning.
Bernd Prach's Almost Orthogonal Layer (AOL).
It gives a cheap way to make a matrix "almost orthogonal."
Not great for full orthogonalization, but much better than rescaling -> perfect as a preconditioner for NS.
Bernd Prach's Almost Orthogonal Layer (AOL).
It gives a cheap way to make a matrix "almost orthogonal."
Not great for full orthogonalization, but much better than rescaling -> perfect as a preconditioner for NS.
How? By pre-conditioning the input matrix.
This makes the algorithm converge faster without losing precision.
How? By pre-conditioning the input matrix.
This makes the algorithm converge faster without losing precision.
And here’s how I squeezed out the extra gain
And here’s how I squeezed out the extra gain
I'll do a PR into the Dion repo when ready !
I'll do a PR into the Dion repo when ready !
If your model converges, you have robust, well-conditioned weights and gradients.
The model will likely be more resistant to input noise.
If your model converges, you have robust, well-conditioned weights and gradients.
The model will likely be more resistant to input noise.
- Larger vector sizes enhance activations utilization.
- Lower precision floating-point math itself adds beneficial non-linearities.
- Larger vector sizes enhance activations utilization.
- Lower precision floating-point math itself adds beneficial non-linearities.