Leshem (Legend) Choshen @EMNLP
banner
lchoshen.bsky.social
Leshem (Legend) Choshen @EMNLP
@lchoshen.bsky.social
🥇 LLMs together (co-created model merging, BabyLM, textArena.ai)
🥈 Spreading science over hype in #ML & #NLP
Proud shareLM💬 Donor

@IBMResearch & @MIT_CSAIL
How do humans recognize themselves? They have memory and they interact with the world. Yhey move in front of the mirror and expect what would happen. If you don't allow that, it won't be easy for humans either. Why would we expect it of models or at all?
October 27, 2025 at 9:30 PM
Well that's a bit of an oversell.
If you are treated with immunotherapy, i. e. Your immune system is encouraged to fight some cancer related enzyme, then this treatment is better. It doesn't help "cancer", but improves a treatment (that is getting more common it seems from googling it)
October 26, 2025 at 12:15 PM
Reposted by Leshem (Legend) Choshen @EMNLP
With a fantastic team of international collaborators we have developed a pipeline for creating LM training data from resources that children are exposed to.

We release this pipeline and welcome new contributions!

Website: babylm.github.io/babybabellm/
Paper: arxiv.org/pdf/2510.10159
October 15, 2025 at 10:53 AM
Indeed, look at how it is encoded, you decode it in just the same way. (They did have lossy video or audio version I believe but didn't look into the details)
October 6, 2025 at 7:11 PM
One thing I couldn't find is speed. Compression ratio is usually a speed vs. performance issue. This seems like a massively slow process, so wonder (even with SLMs) when it is justified, and how well other methods would do given so much more compute.
October 6, 2025 at 4:47 PM
They did it for images, video, text and it all compresses really, really well.
October 6, 2025 at 4:47 PM
So we get on average short numbers to represent sentences. and to decode them we run the model again, get probabilities and decide with those what next word to give to the model.
October 6, 2025 at 4:47 PM
Then we have different probabilities splitting 0.5-0.55 more, and perhaps setting every starting "I am..." to 0.512-0.5124.
The probabilities you get from you favorite LLM.
The result is a single number for your sentence. And if the model was good, this number will be short.
October 6, 2025 at 4:47 PM
Paper: alphaxiv.org/pdf/2407.07723
Arithmetic coding works by sequentially cutting the number space between 0-1.
Consider compressing "I am".
The model says the probability to start a sentence with the token "a" is 30% "the" 20% "I" 5% ...
So any sentence I + ... Falls in 0.50-0.55
Lossless data compression by large models | alphaXiv
View 2 comments: What about speed? What uses do people imagine this can really have? (maybe something with cold storage?)
alphaxiv.org
October 6, 2025 at 4:47 PM
They also make it hard to improve and get feedback
October 5, 2025 at 5:32 PM
I wish I had this guy's chats to research and compare his claims to alternative uses. (E. G. Through shareLM)
September 29, 2025 at 9:09 PM
The paper's authors are also tagged in this thread so maybe they know more
September 26, 2025 at 3:48 PM
One paper also find that cross-linguality is hard across scripts (Replication is always good bsky.app/profile/lcho... )
and they tend to become more cross lingual with training.
September 26, 2025 at 3:27 PM
Thus, a "feature" is defined by the sparse activations we find.
And these are shifting quite rapidly at a certain part in training
September 26, 2025 at 3:27 PM
How can we do it
So crosscoders map activations into a sparse representations and to decode those back into the activations (classic compress decompress).
A single crosscoder is then trained to map activations of all pretrain checkpoints, creating a shared space
September 26, 2025 at 3:27 PM