jimrandomh.bsky.social
@jimrandomh.bsky.social
The last time I checked in, the most promising technique we had was Sparse Autoencoders (www.lesswrong.com/tag/sparse-a...). This is very much on the "kinda-sorta working" side, not actually-working.
Sparse Autoencoders (SAEs) - LessWrong
Sparse Autoencoders (SAEs) are an unsupervised technique for decomposing the activations of a neural network into a sum of interpretable components (often referred to as features). Sparse Autoencoders...
www.lesswrong.com
January 21, 2025 at 1:55 AM
In theory, if we had neural-net interpretability that fully worked, as opposed to kinda-sorta working, this would be resolve many of the hard parts of AI alignment, and it would then be safe to go ahead and build God.
January 21, 2025 at 1:55 AM
You can convert a neural network to a smaller neural network (or a program), but not losslessly. This is a pretty active area of research within Mechanistic Interpretability, because ideally the simplified network will be more amenable to reverse-engineering.
Sparse Autoencoders (SAEs) - LessWrong
Sparse Autoencoders (SAEs) are an unsupervised technique for decomposing the activations of a neural network into a sum of interpretable components (often referred to as features). Sparse Autoencoders...
www.lesswrong.com
January 21, 2025 at 1:55 AM
That's not an American you were talking to, that's a Belgian. Or possibly a Russian troll pretending to be a Belgian; it's hard to tell, but a keyword-search for posts he's made with the keyword "Ukraine" are not inconsistent with that hypothesis.
December 16, 2024 at 6:55 AM