Karen Ullrich (s/h) ✈️ COLM
karen-ullrich.bsky.social
Karen Ullrich (s/h) ✈️ COLM
@karen-ullrich.bsky.social
Research scientist at FAIR NY ❤️ LLMs + Information Theory. Previously, PhD at UoAmsterdam, intern at DeepMind + MSRC.
www.arxiv.org
July 8, 2025 at 1:49 PM
Plus, we generate importance maps showing where in the transformer the concept is encoded — providing interpretable insights into model internals.
July 8, 2025 at 1:49 PM
SAMI: Diminishes or amplifies these modules to control the concept's influence

With SAMI, we can scale the importance of these modules — either amplifying or suppressing specific concepts.
July 8, 2025 at 1:49 PM
SAMD: Finds the attention heads most correlated with a concept

Using SAMD, we find that only a few attention heads are crucial for a wide range of concepts—confirming the sparse, modular nature of knowledge in transformers.
July 8, 2025 at 1:49 PM
next one on the list is Yury Polyanskiy's "Information Theory: From Coding to Learning" which will hopefully hit the shelfs in February... can not wait
November 28, 2024 at 3:49 PM
🫠
November 20, 2024 at 8:57 PM
Me
November 18, 2024 at 6:10 PM
Me
November 18, 2024 at 11:21 AM
What do you think do we need to sharpen our understanding of tokenization? Or will we soon be rid of it by developing models such as "MegaByte" by
Yu et al?
And add more paper to the threat!
October 30, 2024 at 6:29 PM
Phan et al, found a method to mitigate some of the tokenization problems Karpathy mentioned by projecting tokens into byte space. The key to their method is to develop a map between statistically equivalent token and byte-level models.
October 30, 2024 at 6:29 PM
In "The Foundations of Tokenization:
Statistical and Computational Concerns", Gastaldi et al. try to make first steps towards defining what a tokenizer should be and define properties it ought to have.
October 30, 2024 at 6:27 PM
In "Toward a Theory of Tokenization in LLMs" Rajaraman et al., the authors discuss why we can think of tokenization to cause lower perplexity/ a better entropy bound.
October 30, 2024 at 6:27 PM
A must watch entry point is @karpathy.bsky.social hy's "Let's build the GPT Tokenizer" video, where he discusses some tokenization problems.
October 30, 2024 at 6:27 PM