Lightnews — Scholar-powered news

Fern

@fernbear.bsky.social

Having cross-entropy as a default is great, and is really nice for unlabeled data since it all tenda to fall out pretty nicely w.r.t. the learning process, but it is inherently (and necessarily) a much more expensive way to learn a target distribution of values.

February 3, 2025 at 6:14 PM

Fern

@fernbear.bsky.social

Having (a good set of) crowdsourced values for a KL divergence would reduce this variance a bit, and also would give a better value to measure against, due to not being as noisy (in both bias _and_ variance -- a bit of a messy combo to deal with).

February 3, 2025 at 6:12 PM

Fern

@fernbear.bsky.social

When aiming for a 94% accuracy (~6% error rate), this means that that 9% of the remaining labels are "bad", from a cross-entropy perspective.

This is quite a lot! And partially one thing that made testing speedrun results more difficult.

February 3, 2025 at 6:11 PM

Fern

@fernbear.bsky.social

Variance can be a problem in testing models, which extends iterative research cycle length due to needing to run more experiments.

One paper that covered this, arxiv.org/abs/2103.14749, estimated the CIFAR-10 error rate to be at about .54% or so.

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

We identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets, and subsequently study the potential for these label errors to affect b...

arxiv.org

February 3, 2025 at 6:10 PM

Fern

@fernbear.bsky.social

i love science

November 30, 2024 at 10:33 PM

Fern

@fernbear.bsky.social

Okay, that is definitely way too aggressive. Hopefully it's not the case that it's like that long-term -- my hope is that with pushback against overzealous moderation that they change their stance on things like this. Should not be an auto-ban.

November 30, 2024 at 6:04 PM

Fern

@fernbear.bsky.social

Yeah that's why he checks it twice, gotta be something like an approximate Radix sort followed by an insertion sort, I'd guess. P efficient maybe?

November 28, 2024 at 4:59 PM

Fern

@fernbear.bsky.social

Having flexattention w/ DDP is really nice

Also the foreach norm bug is apparently a bother to a few people

November 27, 2024 at 6:26 AM

Fern

@fernbear.bsky.social

story of my life

November 27, 2024 at 2:09 AM

Fern

@fernbear.bsky.social

(And one more thing -- if this was the other site, I'd place a note saying to come and follow me on here. But we're already here, woo! Congratulations us. ❤️

Feel free to drop me a message and say hi, I'd love to chat! ❤️👍)

November 25, 2024 at 2:07 AM

Fern

@fernbear.bsky.social

My time is funded by a combination of personal consulting/contracting work I take, as well as the financial support of others to enable this kind of open source work.
If you'd like to help sponsor my time, check out at github.com/sponsors/tys... (or feel free to drop me a DM, would love to chat!)

Sponsor @tysam-code on GitHub Sponsors

Support open-source research and human-accessible neural network speedrunning benchmarks

github.com

November 25, 2024 at 2:05 AM

Fern

@fernbear.bsky.social

Finally, if you'd like to help see more open-source research work like this, consider sponsoring my time!

November 25, 2024 at 2:04 AM

Fern

@fernbear.bsky.social

Funding for time as well as Colab compute for this work was generously supported by my supporters on Patreon (@jamorton_, @go2carter.bsky.social, @sroeker, @baberb.bsky.social, @chhillee.bsky.social, and @haitchai), as well as the generous support of @algomancer.bsky.social as well.

November 25, 2024 at 2:03 AM

Fern

@fernbear.bsky.social

And this is not all from the research from compute provided from them, I've got more in the pipeline to come! Keep an eye out. ;)

November 25, 2024 at 2:00 AM

Fern

@fernbear.bsky.social

First, my sincerest thanks to @leonardoai.bsky.social with the help of
@ethansmith2000.com for generously providing H100s to support this research to enable this release. Y'all rock, thanks so much! <3

November 25, 2024 at 1:59 AM

Fern

@fernbear.bsky.social

Thanks to FlexAttention (thanks @chhillee.bsky.social and folks), this was very straightforward to implement via attention masking.

Great to be able to port some of that work to this speedrun and see it fly! <3 :)

November 25, 2024 at 1:58 AM

Fern

@fernbear.bsky.social

Fun fact! This is actually a spiritual port of the sequence_length warmup originally implemented in hlb_gpt earlier last year. However, this was extremely hard to do until now due to the nature of how torch.compile worked.

November 25, 2024 at 1:57 AM

Fern

@fernbear.bsky.social

Lowering Adam betas 0.9->0.8 to be a bit more nimble, as well as shortening the momentum warmup in a requisite manner, as well as increasing the number of cooldown steps for the network.

November 25, 2024 at 1:57 AM

Fern

@fernbear.bsky.social

Some of the other changes include some hyperparameter changes to accommodate the increasingly-shortening learning schedules (1750 now vs 3000 two records ago!).

November 25, 2024 at 1:57 AM

Fern

@fernbear.bsky.social

...This means we need an extra step for traditional comparisons, but also shortens the dev loop and biases us towards favoring longer-context learning)

November 25, 2024 at 1:56 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news