Lightnews — Scholar-powered news

Marianne de Heer Kloots

@mdhk.net

1.2K followers 490 following 100 posts

Linguist in AI & CogSci 🧠👩‍💻🤖 PhD student @ ILLC, University of Amsterdam

🌐 https://mdhk.net/
🐘 https://scholar.social/@mdhk
🐦 https://twitter.com/mariannedhk

Posts Replies Media Videos

Marianne de Heer Kloots

@mdhk.net

Thanks to all co-authors in the Dutch SSL training team @hmohebbi.bsky.social @cpouw.bsky.social @gaofeishen.com @wzuidema.bsky.social + Martijn Bentum

And to @itcooperativesurf.bsky.social (EINF-8324) for granting me the resources that enabled this project 👩‍💻✨

August 27, 2025 at 2:31 PM

Marianne de Heer Kloots

@mdhk.net

Check out the paper for more details:
📄 arxiv.org/abs/2506.00981

Or the model, dataset and code released alongside it:
🤗 huggingface.co/amsterdamNLP...
🗃️ zenodo.org/records/1554...
🔍 github.com/mdhk/SSL-NL-...

We hope these resources help further research on language-specificity in speech models!

August 27, 2025 at 2:31 PM

Marianne de Heer Kloots

@mdhk.net

Finally, downstream performance on Dutch speech-to-text transcription reflects the language-specific advantage for Dutch linguistic feature encoding in model-internal representations: on average, Wav2Vec2-NL has a 27% lower word error rate than the multilingual model.

Word Error Rate results for models fine-tuned for Dutch ASR (speech-to-text transcription), across 4 models and 5 evaluation datasets.

August 27, 2025 at 2:31 PM

Marianne de Heer Kloots

@mdhk.net

Furthermore, Wav2Vec2-NL shows a stronger advantage on dialogue (IFADV) than on audiobook (MLS) data.
➡️ Training on conversational speech is important not only for enhancing the representation of conversation-level structures, but also for the encoding of smaller linguistic units (phones & words).

August 27, 2025 at 2:31 PM

Marianne de Heer Kloots

@mdhk.net

But there are also interesting differences between methods: for example, trained probes show stronger language-specific advantages for phonetic encoding than zero-shot metrics. 
➡️ Language-specific phonetic information may only take up a relatively small subspace of model-internal representations.

August 27, 2025 at 2:31 PM

Marianne de Heer Kloots

@mdhk.net

We find that language-specific advantages are well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. I.e. the encoding of Dutch linguistic features is enhanced in the Dutch model, as compared to models trained on English and multilingual data.

Layerwise phonetic and lexical analyses, across a read speech (MLS, top row) and a dialogue (IFADV, bottom row) dataset of spoken Dutch. Measures marked * involve optimized linear transforms, whereas others are computed zero-shot; shading indicates 95% confidence intervals. The Dutch Wav2Vec2-NL model achieves highest scores across most analyses of Dutch phone and word encoding, though the size of this language-specific advantage varies considerably across analyses.

August 27, 2025 at 2:31 PM

Marianne de Heer Kloots

@mdhk.net

But they also used different analysis techniques.

We designed the SSL-NL dataset to test the encoding of Dutch phonetic and lexical features in SSL speech representations, while allowing for comparisons across different analysis methods.

We compare both trained probes(*) and zero-shot metrics:

The model comparison set includes Wav2Vec2-NL and 3 other existing Wav2Vec2-base models: facebook's multilingual voxpopuli model, facebook's English base model, and another model trained on nonspeech acoustics.

The set of analysis techniques includes probing classifiers (logistic regression), ABX similarities, PCA clustering, LDA clustering, and representational similarity analysis (RSA).

Word- and phone-level embeddings were created by mean-pooling model frame embeddings within words and phones respectively.

The SSL-NL evaluation dataset is a curated dataset of Dutch speech recordings and accompanying forced alignments, across two domains: audiobooks (MLS) and face-to-face conversations (IFADV).

August 27, 2025 at 2:31 PM

Marianne de Heer Kloots

@mdhk.net

Wav2Vec2-NL is trained exclusively (from scratch) on 831 hours of Dutch speech recordings. So does this help the model to encode Dutch-specific phonetic and lexical information?

Previous studies analyzing language-specific representations in speech SSL models have reported mixed results.

August 27, 2025 at 2:31 PM

Marianne de Heer Kloots

@mdhk.net

We also share a working bibliography of recent publications reporting speech model interpretability analyses, that we've compiled while surveying the literature. It is incomplete and we would love your input! github.com/mdhk/awesome...

August 20, 2025 at 5:09 AM

Marianne de Heer Kloots

@mdhk.net

The materials include slides and notebooks by @grzegorz.chrupala.me, Martijn Bentum, @cpouw.bsky.social, @hmohebbi.bsky.social, @gaofeishen.com, @wzuidema.bsky.social & me.
Find an overview here: interpretingdl.github.io/speech-inter...

August 19, 2025 at 9:23 PM

Marianne de Heer Kloots

@mdhk.net

Last but not least, I personally can’t wait for the social event on Thursday night that we’ve been planning for the past year ✨
It features a *live brain-controlled music act* by the AIAR collective 🧠🎶 2025.ccneuro.org/social-event/ Get one of the last remaining tickets at the registration desk now!

August 12, 2025 at 2:19 PM

Marianne de Heer Kloots

@mdhk.net

Raquel Fernández will present our joint project with @annabavaresco.bsky.social and Sandro Pezzelle: Modelling Multimodal Integration in Human Concept Processing with Vision-Language Models (poster B32)
🔗 2025.ccneuro.org/poster/?id=D...

August 12, 2025 at 2:19 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news