Fern
fernbear.bsky.social
Fern
@fernbear.bsky.social
Neural network speedrunner and community-funded open source researcher. Set the CIFAR-10 record several times. Send me consulting/contracting work! she/they❤️
This is a classic example of _why_ choose-one-of-n datasets need to have large-scale, crowd-sourced statistics and should use the KL-divergence instead of cross-entropy.

Reviewers will be more biased than a crowd, it's a high variance+bias estimator, it can harm research.
February 3, 2025 at 6:03 PM
Reposted by Fern
Did you know that attention across the whole input span was inspired by the time-negating alien language in Arrival? Crazy anecdote from the latest Hard Fork podcast (by @kevinroose.com and @caseynewton.bsky.social). HT nwbrownboi on Threads for the lead.
December 1, 2024 at 2:50 PM
Reposted by Fern
it's crazy to me that RoPE's issue with BF16 wasn't noticed earlier.
For a reasonable N of 2048, these are the computed frequencies prior to cos(x) & sin(x) for fp32 above and bf16 below.
Given how short the period is of simple trig functions, this difference is catastrophic for large values.
November 28, 2024 at 12:09 PM
Reposted by Fern
Just added FSDP2 support for MARS and Muon!
November 25, 2024 at 10:39 PM
Thanks for 100 followers, y'all! Happened so fast and can't wait to put out more research on here! 😊❤️
November 25, 2024 at 7:16 PM
New NanoGPT training speed record: 3.28 FineWeb val loss in 4.66 minutes

Previous record: 5.03 minutes
Changelog:
- FlexAttention blocksize warmup
- hyperparameter tweaks
November 25, 2024 at 1:53 AM
Reposted by Fern
NATTEN just added fused support for self-cross attention!
so you can attend to local neighbourhood and registers or text condition.
it lets you reduce partial attention results (e.g. logsumexp provided by xformers APIs) into its LSE.
github.com/SHI-Labs/NAT...
Support for fused cross-NA by alihassanijr · Pull Request #182 · SHI-Labs/NATTEN
Adds experimental support for additional context tokens to Fused NA. Any number of partial attention results can be reduced into a final one as if their contexts were merged, which is just the same...
github.com
November 19, 2024 at 4:05 PM
Reposted by Fern
❤️ my MNIST socks
November 24, 2024 at 1:27 AM
Reposted by Fern
Radon Transform (RT) was formulated in 1917 but remained useless in practice until CT scanners were invented in the 60s

But RT isn't just for CTs. It's a sort of generalization of marginals in probability

RT g(p,θ): Shoot rays at θ+90 & offset p, measure line integrals of f(x,y) along the ray

1/n
November 24, 2024 at 12:33 AM
Reposted by Fern
Here, have PSGD-Kron and SOAP with FSDP2 support. Please go wild with it, let's see something finally replace ADAM.
github.com/ethansmith20...
November 23, 2024 at 4:02 PM
Reposted by Fern
probably the best in-depth explanation i've seen on FSDP at the most granular levels, props to the authors
dev-discuss.pytorch.org/t/fsdp-cudac...
November 23, 2024 at 5:02 AM