So our preprint, driven by @lukasbillera.bsky.social with assists from @hedwignordlinder.bsky.social, formalizes this, and extends it a little in ways that are trickier to heuristically reason about:
arxiv.org/abs/2511.16599
So our preprint, driven by @lukasbillera.bsky.social with assists from @hedwignordlinder.bsky.social, formalizes this, and extends it a little in ways that are trickier to heuristically reason about:
arxiv.org/abs/2511.16599
Since the dawn of time, people have been messing with (or dropping entirely) these pesky time-dependent loss scaling terms, mostly because the models train better without them.
Since the dawn of time, people have been messing with (or dropping entirely) these pesky time-dependent loss scaling terms, mostly because the models train better without them.
The manuscript should be up by tomorrow and I'll drop a link.
The manuscript should be up by tomorrow and I'll drop a link.