We trained 3 models - 1.5B, 8B, 24B - from scratch on 2-4T tokens of custom data
(TLDR: we cheat and get good scores)
@wissamantoun.bsky.social @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social
We trained 3 models - 1.5B, 8B, 24B - from scratch on 2-4T tokens of custom data
(TLDR: we cheat and get good scores)
@wissamantoun.bsky.social @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social
It still became a polarization machine.
Then we tried six interventions to fix social media.
The results were… not what we expected.
arxiv.org/abs/2508.03385
It still became a polarization machine.
Then we tried six interventions to fix social media.
The results were… not what we expected.
arxiv.org/abs/2508.03385
What's driving performance: architecture or data?
To find out we pretrained ModernBERT on the same dataset as CamemBERTaV2 (a DeBERTaV3 model) to isolate architecture effects.
Here are our findings:
What's driving performance: architecture or data?
To find out we pretrained ModernBERT on the same dataset as CamemBERTaV2 (a DeBERTaV3 model) to isolate architecture effects.
Here are our findings:
👩🎓👩🎓🎉
@inriaparisnlp.bsky.social
@sorbonne-universite.fr
👩🎓👩🎓🎉
@inriaparisnlp.bsky.social
@sorbonne-universite.fr
1/
1/
htr-united.github.io
htr-united.github.io