@aubmindlab Alumni
Interested in AI, NLP, Video Games
wissamantoun.com
and even talks about ModernDeBERTa x.com/bclavie/stat....
@bclavie.bsky.social
@nohtow.bsky.social
and even talks about ModernDeBERTa x.com/bclavie/stat....
@bclavie.bsky.social
@nohtow.bsky.social
So choose based on your priorities!
Huge thanks to my advisors:
@zehavoc.bsky.social @bensagot.bsky.social and @inriaparisnlp.bsky.social
Here's the full paper with the details: arxiv.org/abs/2504.08716
So choose based on your priorities!
Huge thanks to my advisors:
@zehavoc.bsky.social @bensagot.bsky.social and @inriaparisnlp.bsky.social
Here's the full paper with the details: arxiv.org/abs/2504.08716
ModernBERT exhibits instabilities in downstream fine-tuning tasks.
While DeBERTaV3 offers more stable training dynamics.
ModernBERT exhibits instabilities in downstream fine-tuning tasks.
While DeBERTaV3 offers more stable training dynamics.
High-quality pretraining data accelerates convergence but offers minimal gains in final performance.
We suggest that current benchmarks may be saturated, limiting their ability to distinguish model improvements.
High-quality pretraining data accelerates convergence but offers minimal gains in final performance.
We suggest that current benchmarks may be saturated, limiting their ability to distinguish model improvements.
When trained on identical data, DeBERTaV3 outperforms ModernBERT in benchmark tasks.
ModernBERT's strength is faster training and inference, but it doesn't surpass DeBERTaV3 in accuracy on NLU tasks.
When trained on identical data, DeBERTaV3 outperforms ModernBERT in benchmark tasks.
ModernBERT's strength is faster training and inference, but it doesn't surpass DeBERTaV3 in accuracy on NLU tasks.
Bon appétit!
[8/8]
Bon appétit!
[8/8]
Access to compute resources was granted by Stéphane Requena and Genci on the Jean Zay
For more details please check our new paper (arxiv.org/abs/2411.08868)
[7/8]
Access to compute resources was granted by Stéphane Requena and Genci on the Jean Zay
For more details please check our new paper (arxiv.org/abs/2411.08868)
[7/8]
Model Link: huggingface.co/almanach?sea...
[6/8]
Model Link: huggingface.co/almanach?sea...
[6/8]
The new models vastly outperform their predecessors and even match domain-specific finetunes🧑⚕️.
[5/8]
The new models vastly outperform their predecessors and even match domain-specific finetunes🧑⚕️.
[5/8]
RTD’s efficiency allowed us to train for 1 epoch vs. 3 for MLM.
Pre-training had 2 phases: seq length 512, then 1024 on long docs.
[4/8]
RTD’s efficiency allowed us to train for 1 epoch vs. 3 for MLM.
Pre-training had 2 phases: seq length 512, then 1024 on long docs.
[4/8]
- 32,768 tokens
- addition of newline and tab characters
- support emojis with zero-width-joiner
- numbers are split into two digits tokens
- support French elisions
[3/8]
- 32,768 tokens
- addition of newline and tab characters
- support emojis with zero-width-joiner
- numbers are split into two digits tokens
- support French elisions
[3/8]
- Much larger pretraining dataset: 275B tokens (previously ~32B) from French CulturaX, scientific articles from HAL, and Wikipedia.
Only 1 epoch was needed for CamemBERTa-v2 while the CamemBERT-v2 model was trained for 3 epochs or 825B tokens.
[2/8]
- Much larger pretraining dataset: 275B tokens (previously ~32B) from French CulturaX, scientific articles from HAL, and Wikipedia.
Only 1 epoch was needed for CamemBERTa-v2 while the CamemBERT-v2 model was trained for 3 epochs or 825B tokens.
[2/8]