www: jemoka.com
ac: nlp.stanford.edu/~houjun/
✅ Learn yourself a reasoning model with normal pretraining
✅ Better perplexity compared to fixed thinking tokens
No fancy loss, no chain of thought labels 🚀
✅ Learn yourself a reasoning model with normal pretraining
✅ Better perplexity compared to fixed thinking tokens
No fancy loss, no chain of thought labels 🚀
✅ Learn yourself a reasoning model with normal pretraining
✅ Better perplexity compared to fixed thinking tokens
No fancy loss, no chain of thought labels 🚀
✅ Learn yourself a reasoning model with normal pretraining
✅ Better perplexity compared to fixed thinking tokens
No fancy loss, no chain of thought labels 🚀
🤯 Why does flow matching generalize? Did you know that the flow matching target you're trying to learn *can only generate training points*?
w @quentinbertrand.bsky.social @annegnx.bsky.social @remiemonet.bsky.social 👇👇👇
🤯 Why does flow matching generalize? Did you know that the flow matching target you're trying to learn *can only generate training points*?
w @quentinbertrand.bsky.social @annegnx.bsky.social @remiemonet.bsky.social 👇👇👇
You should **drop dropout** when you are training your LMs AND MLMs!
You should **drop dropout** when you are training your LMs AND MLMs!
You should **drop dropout** when you are training your LMs AND MLMs!
You should **drop dropout** when you are training your LMs AND MLMs!
people.csail.mit.edu/rrw/time-vs-...
It's still hard for me to believe it myself, but I seem to have shown that TIME[t] is contained in SPACE[sqrt{t log t}].
To appear in STOC. Comments are very welcome!