Lightnews — Scholar-powered news

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

3.1K followers 780 following 860 posts

🥇 LLMs together (co-created model merging, BabyLM, textArena.ai)
🥈 Spreading science over hype in #ML & #NLP
Proud shareLM💬 Donor

@IBMResearch & @MIT_CSAIL

Posts Replies Media Videos

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

How do humans recognize themselves? They have memory and they interact with the world. Yhey move in front of the mirror and expect what would happen. If you don't allow that, it won't be easy for humans either. Why would we expect it of models or at all?

October 27, 2025 at 9:30 PM

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

Well that's a bit of an oversell.
If you are treated with immunotherapy, i. e. Your immune system is encouraged to fight some cancer related enzyme, then this treatment is better. It doesn't help "cancer", but improves a treatment (that is getting more common it seems from googling it)

October 26, 2025 at 12:15 PM

Reposted by Leshem (Legend) Choshen @EMNLP

Jaap Jumelet

@jumelet.bsky.social

With a fantastic team of international collaborators we have developed a pipeline for creating LM training data from resources that children are exposed to.

We release this pipeline and welcome new contributions!

Website: babylm.github.io/babybabellm/
Paper: arxiv.org/pdf/2510.10159

October 15, 2025 at 10:53 AM

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

Indeed, look at how it is encoded, you decode it in just the same way. (They did have lossy video or audio version I believe but didn't look into the details)

October 6, 2025 at 7:11 PM

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

One thing I couldn't find is speed. Compression ratio is usually a speed vs. performance issue. This seems like a massively slow process, so wonder (even with SLMs) when it is justified, and how well other methods would do given so much more compute.

October 6, 2025 at 4:47 PM

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

They did it for images, video, text and it all compresses really, really well.

October 6, 2025 at 4:47 PM

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

So we get on average short numbers to represent sentences. and to decode them we run the model again, get probabilities and decide with those what next word to give to the model.

October 6, 2025 at 4:47 PM

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

Then we have different probabilities splitting 0.5-0.55 more, and perhaps setting every starting "I am..." to 0.512-0.5124.
The probabilities you get from you favorite LLM.
The result is a single number for your sentence. And if the model was good, this number will be short.

October 6, 2025 at 4:47 PM

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

Paper: alphaxiv.org/pdf/2407.07723
Arithmetic coding works by sequentially cutting the number space between 0-1.
Consider compressing "I am".
The model says the probability to start a sentence with the token "a" is 30% "the" 20% "I" 5% ...
So any sentence I + ... Falls in 0.50-0.55

Lossless data compression by large models | alphaXiv

View 2 comments: What about speed? What uses do people imagine this can really have? (maybe something with cold storage?)

alphaxiv.org

October 6, 2025 at 4:47 PM

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

They also make it hard to improve and get feedback

October 5, 2025 at 5:32 PM

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

I wish I had this guy's chats to research and compare his claims to alternative uses. (E. G. Through shareLM)

September 29, 2025 at 9:09 PM

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

The paper's authors are also tagged in this thread so maybe they know more

September 26, 2025 at 3:48 PM

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

Not too much
www.semanticscholar.org/paper/Evolut...

www.semanticscholar.org/reader/5a527...

There's Guy Hacohen's work about order of learning (less features, but there are principal components and hardness and stuff.)
scholar.google.co.il/citations?hl...

[PDF] Evolution of Low-Level and Texture Human-CLIP Alignment | Semantic Scholar

The findings suggest that CLIP initially learn low-level visual features, enhancing its alignment with low-level human perception but also increasing its sensitivity to noise and its texture bias, and...

www.semanticscholar.org

September 26, 2025 at 3:48 PM

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

One paper also find that cross-linguality is hard across scripts (Replication is always good bsky.app/profile/lcho... )
and they tend to become more cross lingual with training.

September 26, 2025 at 3:27 PM

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

Thus, a "feature" is defined by the sparse activations we find.
And these are shifting quite rapidly at a certain part in training

September 26, 2025 at 3:27 PM

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

How can we do it
So crosscoders map activations into a sparse representations and to decode those back into the activations (classic compress decompress).
A single crosscoder is then trained to map activations of all pretrain checkpoints, creating a shared space

September 26, 2025 at 3:27 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news