Atlas Wang
@atlaswang.bsky.social
https://www.vita-group.space/ 👨🏫 UT Austin ML Professor (on leave)
https://www.xtxmarkets.com/ 🏦 XTX Markets Research Director (NYC AI Lab)
Superpower is trying everything 🪅
Newest focus: training next-generation super intelligence - Preview above 👶
https://www.xtxmarkets.com/ 🏦 XTX Markets Research Director (NYC AI Lab)
Superpower is trying everything 🪅
Newest focus: training next-generation super intelligence - Preview above 👶
(1/n) My favorite "optimizer" work of 2024:
📢 Introducing APOLLO! 🚀: SGD-like memory cost, yet AdamW-level performance (or better!).
❓ How much memory do we need for optimization states in LLM training ? 🧐
Almost zero.
📜 Paper: arxiv.org/abs/2412.05270
🔗 GitHub: github.com/zhuhanqing/A...
📢 Introducing APOLLO! 🚀: SGD-like memory cost, yet AdamW-level performance (or better!).
❓ How much memory do we need for optimization states in LLM training ? 🧐
Almost zero.
📜 Paper: arxiv.org/abs/2412.05270
🔗 GitHub: github.com/zhuhanqing/A...
December 10, 2024 at 12:53 PM
(1/n) My favorite "optimizer" work of 2024:
📢 Introducing APOLLO! 🚀: SGD-like memory cost, yet AdamW-level performance (or better!).
❓ How much memory do we need for optimization states in LLM training ? 🧐
Almost zero.
📜 Paper: arxiv.org/abs/2412.05270
🔗 GitHub: github.com/zhuhanqing/A...
📢 Introducing APOLLO! 🚀: SGD-like memory cost, yet AdamW-level performance (or better!).
❓ How much memory do we need for optimization states in LLM training ? 🧐
Almost zero.
📜 Paper: arxiv.org/abs/2412.05270
🔗 GitHub: github.com/zhuhanqing/A...
The gradient oscillates rapidly during training of the next-generation superintelligence model (preview version).
Wanna call it Edge of Stability?
Wanna call it Edge of Stability?
November 21, 2024 at 7:31 PM
The gradient oscillates rapidly during training of the next-generation superintelligence model (preview version).
Wanna call it Edge of Stability?
Wanna call it Edge of Stability?