🥈 Spreading science over hype in #ML & #NLP
Proud shareLM💬 Donor
@IBMResearch & @MIT_CSAIL
If you are treated with immunotherapy, i. e. Your immune system is encouraged to fight some cancer related enzyme, then this treatment is better. It doesn't help "cancer", but improves a treatment (that is getting more common it seems from googling it)
If you are treated with immunotherapy, i. e. Your immune system is encouraged to fight some cancer related enzyme, then this treatment is better. It doesn't help "cancer", but improves a treatment (that is getting more common it seems from googling it)
We release this pipeline and welcome new contributions!
Website: babylm.github.io/babybabellm/
Paper: arxiv.org/pdf/2510.10159
We release this pipeline and welcome new contributions!
Website: babylm.github.io/babybabellm/
Paper: arxiv.org/pdf/2510.10159
The probabilities you get from you favorite LLM.
The result is a single number for your sentence. And if the model was good, this number will be short.
The probabilities you get from you favorite LLM.
The result is a single number for your sentence. And if the model was good, this number will be short.
Arithmetic coding works by sequentially cutting the number space between 0-1.
Consider compressing "I am".
The model says the probability to start a sentence with the token "a" is 30% "the" 20% "I" 5% ...
So any sentence I + ... Falls in 0.50-0.55
Arithmetic coding works by sequentially cutting the number space between 0-1.
Consider compressing "I am".
The model says the probability to start a sentence with the token "a" is 30% "the" 20% "I" 5% ...
So any sentence I + ... Falls in 0.50-0.55
www.semanticscholar.org/paper/Evolut...
www.semanticscholar.org/reader/5a527...
There's Guy Hacohen's work about order of learning (less features, but there are principal components and hardness and stuff.)
scholar.google.co.il/citations?hl...
www.semanticscholar.org/paper/Evolut...
www.semanticscholar.org/reader/5a527...
There's Guy Hacohen's work about order of learning (less features, but there are principal components and hardness and stuff.)
scholar.google.co.il/citations?hl...
and they tend to become more cross lingual with training.
and they tend to become more cross lingual with training.
And these are shifting quite rapidly at a certain part in training
And these are shifting quite rapidly at a certain part in training
So crosscoders map activations into a sparse representations and to decode those back into the activations (classic compress decompress).
A single crosscoder is then trained to map activations of all pretrain checkpoints, creating a shared space
So crosscoders map activations into a sparse representations and to decode those back into the activations (classic compress decompress).
A single crosscoder is then trained to map activations of all pretrain checkpoints, creating a shared space