Fern
fernbear.bsky.social
Fern
@fernbear.bsky.social
Neural network speedrunner and community-funded open source researcher. Set the CIFAR-10 record several times. Send me consulting/contracting work! she/they❤️
Having cross-entropy as a default is great, and is really nice for unlabeled data since it all tenda to fall out pretty nicely w.r.t. the learning process, but it is inherently (and necessarily) a much more expensive way to learn a target distribution of values.
February 3, 2025 at 6:14 PM
Having (a good set of) crowdsourced values for a KL divergence would reduce this variance a bit, and also would give a better value to measure against, due to not being as noisy (in both bias _and_ variance -- a bit of a messy combo to deal with).
February 3, 2025 at 6:12 PM
When aiming for a 94% accuracy (~6% error rate), this means that that 9% of the remaining labels are "bad", from a cross-entropy perspective.

This is quite a lot! And partially one thing that made testing speedrun results more difficult.
February 3, 2025 at 6:11 PM
Variance can be a problem in testing models, which extends iterative research cycle length due to needing to run more experiments.

One paper that covered this, arxiv.org/abs/2103.14749, estimated the CIFAR-10 error rate to be at about .54% or so.
Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks
We identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets, and subsequently study the potential for these label errors to affect b...
arxiv.org
February 3, 2025 at 6:10 PM
i love science
November 30, 2024 at 10:33 PM
Okay, that is definitely way too aggressive. Hopefully it's not the case that it's like that long-term -- my hope is that with pushback against overzealous moderation that they change their stance on things like this. Should not be an auto-ban.
November 30, 2024 at 6:04 PM
Yeah that's why he checks it twice, gotta be something like an approximate Radix sort followed by an insertion sort, I'd guess. P efficient maybe?
November 28, 2024 at 4:59 PM
Having flexattention w/ DDP is really nice

Also the foreach norm bug is apparently a bother to a few people
November 27, 2024 at 6:26 AM
story of my life
November 27, 2024 at 2:09 AM
(And one more thing -- if this was the other site, I'd place a note saying to come and follow me on here. But we're already here, woo! Congratulations us. ❤️

Feel free to drop me a message and say hi, I'd love to chat! ❤️👍)
November 25, 2024 at 2:07 AM
My time is funded by a combination of personal consulting/contracting work I take, as well as the financial support of others to enable this kind of open source work.
If you'd like to help sponsor my time, check out at github.com/sponsors/tys... (or feel free to drop me a DM, would love to chat!)
Sponsor @tysam-code on GitHub Sponsors
Support open-source research and human-accessible neural network speedrunning benchmarks
github.com
November 25, 2024 at 2:05 AM
Finally, if you'd like to help see more open-source research work like this, consider sponsoring my time!
November 25, 2024 at 2:04 AM
Funding for time as well as Colab compute for this work was generously supported by my supporters on Patreon (@jamorton_, @go2carter.bsky.social, @sroeker, @baberb.bsky.social, @chhillee.bsky.social, and @haitchai), as well as the generous support of @algomancer.bsky.social as well.
November 25, 2024 at 2:03 AM
And this is not all from the research from compute provided from them, I've got more in the pipeline to come! Keep an eye out. ;)
November 25, 2024 at 2:00 AM
First, my sincerest thanks to @leonardoai.bsky.social with the help of
@ethansmith2000.com for generously providing H100s to support this research to enable this release. Y'all rock, thanks so much! <3
November 25, 2024 at 1:59 AM
Thanks to FlexAttention (thanks @chhillee.bsky.social and folks), this was very straightforward to implement via attention masking.

Great to be able to port some of that work to this speedrun and see it fly! <3 :)
November 25, 2024 at 1:58 AM
Fun fact! This is actually a spiritual port of the sequence_length warmup originally implemented in hlb_gpt earlier last year. However, this was extremely hard to do until now due to the nature of how torch.compile worked.
November 25, 2024 at 1:57 AM
Lowering Adam betas 0.9->0.8 to be a bit more nimble, as well as shortening the momentum warmup in a requisite manner, as well as increasing the number of cooldown steps for the network.
November 25, 2024 at 1:57 AM
Some of the other changes include some hyperparameter changes to accommodate the increasingly-shortening learning schedules (1750 now vs 3000 two records ago!).
November 25, 2024 at 1:57 AM
...This means we need an extra step for traditional comparisons, but also shortens the dev loop and biases us towards favoring longer-context learning)
November 25, 2024 at 1:56 AM