PSGD: MSE( Q.T Q H , I ) = 5.2e-3
Zero-Power NS 100 iterations: MSE( NS(G) , I ) = 8.2e-1
True Inverse: MSE( H^(-1/2) H H^(-1/2), I ) = 6.1e-3
PSGD whitens information significantly better than the Newton-Schulz iters found in Muon
PSGD: MSE( Q.T Q H , I ) = 5.2e-3
Zero-Power NS 100 iterations: MSE( NS(G) , I ) = 8.2e-1
True Inverse: MSE( H^(-1/2) H H^(-1/2), I ) = 6.1e-3
PSGD whitens information significantly better than the Newton-Schulz iters found in Muon
MARS is a new exciting variance reduction technique from @quanquangu.bsky.social 's group which can help stabilize and accelerate your deep learning pipeline. All that is needed is a gradient buffer. Here MARS speeds up the convergence of PSGD ultimately leading to a better solution.
MARS is a new exciting variance reduction technique from @quanquangu.bsky.social 's group which can help stabilize and accelerate your deep learning pipeline. All that is needed is a gradient buffer. Here MARS speeds up the convergence of PSGD ultimately leading to a better solution.