Lightnews — Scholar-powered news

Wissam Antoun

@wissamantoun.bsky.social

Also please check the ongoing conversation around the newly released #FlashDeBERTa kernels x.com/knowledgator....

and even talks about ModernDeBERTa x.com/bclavie/stat....
@bclavie.bsky.social
@nohtow.bsky.social

Knowledgator on X: "⚡ Introducing truly Flash DeBERTa implementation ⚡ DeBERTa remains a top-performing model in Named Entity Recognition, Text Classification, and Reranking. Let's just make it more efficient, meet #FlashDeBERTa, which is up to 3-5 times faster. https://t.co/fxxCk3eUaw" / X

⚡ Introducing truly Flash DeBERTa implementation ⚡ DeBERTa remains a top-performing model in Named Entity Recognition, Text Classification, and Reranking. Let's just make it more efficient, meet #FlashDeBERTa, which is up to 3-5 times faster. https://t.co/fxxCk3eUaw

x.com

April 14, 2025 at 4:03 PM

Wissam Antoun

@wissamantoun.bsky.social

We also release our ModernCamemBERT models on @hf.co huggingface.co/collections/...

So choose based on your priorities!

Huge thanks to my advisors:
@zehavoc.bsky.social @bensagot.bsky.social and @inriaparisnlp.bsky.social

Here's the full paper with the details: arxiv.org/abs/2504.08716

ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance

Pretrained transformer-encoder models like DeBERTaV3 and ModernBERT introduce architectural advancements aimed at improving efficiency and performance. Although the authors of ModernBERT report improv...

arxiv.org

April 14, 2025 at 4:03 PM

Wissam Antoun

@wissamantoun.bsky.social

⚠️ Finetuning stability matters.

ModernBERT exhibits instabilities in downstream fine-tuning tasks.

While DeBERTaV3 offers more stable training dynamics.

April 14, 2025 at 3:41 PM

Wissam Antoun

@wissamantoun.bsky.social

Data quality matters?

High-quality pretraining data accelerates convergence but offers minimal gains in final performance.

We suggest that current benchmarks may be saturated, limiting their ability to distinguish model improvements.

April 14, 2025 at 3:41 PM

Wissam Antoun

@wissamantoun.bsky.social

Key takeaway:

When trained on identical data, DeBERTaV3 outperforms ModernBERT in benchmark tasks.

ModernBERT's strength is faster training and inference, but it doesn't surpass DeBERTaV3 in accuracy on NLU tasks.

April 14, 2025 at 3:41 PM

Wissam Antoun

@wissamantoun.bsky.social

Finally, this was all possible with the help of my colleagues and supervisors at ALMAnaCH @Inria: Francis Kulumba,Rian Touchent, Éric de la Clergerie, Benoît Sagot @zehavoc.bsky.social
Bon appétit!
[8/8]

November 15, 2024 at 5:07 PM

Wissam Antoun

@wissamantoun.bsky.social

This work was partially funded by the DINUM @Numerique_Gouv through the AllIAnce program (alliance.numerique.gouv.fr/les-produits...)

Access to compute resources was granted by Stéphane Requena and Genci on the Jean Zay

For more details please check our new paper (arxiv.org/abs/2411.08868)
[7/8]

Camembert 2.0

Un outil d’intelligence artificielle utilisant des photos panoramiques du terrain pour y détecter panneaux de signalisation ou infrastructures publiques. Cette […]

alliance.numerique.gouv.fr

November 15, 2024 at 5:07 PM

Wissam Antoun

@wissamantoun.bsky.social

As part of our transparent release, we make public all our pretrained models, the training checkpoints, best-performing finetunes, pre-training data, and the codebase all on HuggingFace @huggingface
Model Link: huggingface.co/almanach?sea...
[6/8]

almanach (ALMAnaCH (Inria))

NLP, Digital Humanities

huggingface.co

November 15, 2024 at 5:07 PM

Wissam Antoun

@wissamantoun.bsky.social

Check the improved performance across a variety of general and domain-specific French NLP tasks.

The new models vastly outperform their predecessors and even match domain-specific finetunes🧑‍⚕️.

[5/8]

November 15, 2024 at 5:07 PM

Wissam Antoun

@wissamantoun.bsky.social

CamemBERTa-v2 uses the Replaced Token Detection (RTD) from DeBERTaV3, while CamemBERT-v2 uses MLM with 40% masking.

RTD’s efficiency allowed us to train for 1 epoch vs. 3 for MLM.

Pre-training had 2 phases: seq length 512, then 1024 on long docs.

[4/8]

November 15, 2024 at 5:07 PM

Wissam Antoun

@wissamantoun.bsky.social

A newly built tokenizer based on WordPiece:
- 32,768 tokens
- addition of newline and tab characters
- support emojis with zero-width-joiner
- numbers are split into two digits tokens
- support French elisions

[3/8]

November 15, 2024 at 5:07 PM

Wissam Antoun

@wissamantoun.bsky.social

The new update includes:

- Much larger pretraining dataset: 275B tokens (previously ~32B) from French CulturaX, scientific articles from HAL, and Wikipedia.

Only 1 epoch was needed for CamemBERTa-v2 while the CamemBERT-v2 model was trained for 3 epochs or 825B tokens.

[2/8]

November 15, 2024 at 5:07 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news