Marvin Lavechin
banner
marvinlavechin.bsky.social
Marvin Lavechin
@marvinlavechin.bsky.social
Machine learning, speech processing, language acquisition and cognition.
Soon @cnrs.fr @univ-amu.fr; currently postdoc at MIT, Cambridge, US.
Perhaps you want to look at the Hungarian algorithm which is used to compute diarization error rate
November 12, 2025 at 11:31 PM
Or mark it with "false alarm" if the overlap percentage is too low.

This has the advantage of being easy to explain/implement and the disadvantage of being quite inefficient.

You could also add a speaker criterion if you want to matched segments both based on onset/offset and speaker identity
November 12, 2025 at 11:31 PM
For each automatic segment, compute the percentage of overlap with your individual manual segments (say automatic 3 overlaps 80% with manual 4 and 15% with manual 3).

Assign this automatic segment to the manual segment with the greatest overlap.
November 12, 2025 at 11:31 PM
I don't think there's a standard way to do this. I'd go with something like this, assuming manual and automatic segments are labeled as SPEECH (regardless of who speaks)

Assign all of your manual segments a number: 1, 2, 3, 4,...

Assign all of your automatic segments a number: 1, 2, 3, 4,...
November 12, 2025 at 11:31 PM
For a beginner-friendly intro to some of these metrics, this is what I tried to do in osf.io/preprints/ps... or Alex Cristia in link.springer.com/article/10.3...
OSF
osf.io
November 12, 2025 at 4:26 PM
Including precision, recall, f-score and diarization error rate.
November 12, 2025 at 4:26 PM
@fusaroli.bsky.social if you're familiar with python, pyannote.metrics is the perfect library. It comes with multiple performance metrics to evaluate speaker segmentation/diarization algorithms.

pyannote.github.io/pyannote-met...

@hbredin.bsky.social 🙏
pyannote.metrics — pyannote.metrics 4.0.1.dev0+g304c107ca.d20250909 documentation
pyannote.github.io
November 12, 2025 at 4:26 PM
Right, or if tokens (as in discrete categories) per-se are the right way to represent things
November 10, 2025 at 11:21 PM
We'll finally get to see if models are on par with children 🍿
I bet no 🙃
November 10, 2025 at 11:20 PM
Yes! What's exciting with the First1KDays approach to me is that for the 1st time, we have dense naturalistic data across the full first 3 years AND the computational scale to model it. Can't hide behind the "sparse dataset" or "non-plausible data" argument anymore.
November 10, 2025 at 11:20 PM
5) Can't skip Jim Glass from MIT who has been working on automatic spoken language understanding for >30 years (including audiovisual models and recent SSL architectures).

The hype around LMs is very new, but there's a long tradition of using them to model lang. acq. in children.
November 10, 2025 at 10:11 PM
4) Afra Alishahi & @grzegorz.chrupala.me for models that learn English from Peppa Pig 🐷 direct.mit.edu/tacl/article...

and many more that I missed!
Learning English with Peppa Pig
Abstract. Recent computational models of the acquisition of spoken language via grounding in perception exploit associations between spoken and visual modalities and learn to represent speech and visu...
direct.mit.edu
November 10, 2025 at 9:43 PM
3) Really cool work from Okko Räsänen on the acquisition of phonemes, words, and word meanings from audiovisual input escholarship.org/uc/item/79t0...
Computational Insights to Acquisition of Phonemes, Words, and Word Meanings in Early Language: Sequential or Parallel Acquisition?
Author(s): Khorrami, Khazar; Cruz Blandón, María Andrea; Räsänen, Okko | Abstract: Previous computational models of early language acquisition have shown how linguistic structure of speech can be acqu...
escholarship.org
November 10, 2025 at 9:43 PM
2) hal.science/hal-04876433...
That model is directly trained on child-centered long-form recordings (those collected by @bergelsonlab.bsky.social); spoiler: learning from spontaneous & noisy speech makes the problem even more difficult, but also more interesting in my opinion!
hal.science
November 10, 2025 at 9:43 PM
The community is smaller, but there's work out there. A few examples:

1) onlinelibrary.wiley.com/doi/10.1111/... on early sound and word (form) acquisition in SSL models; many analyses about what the learned tokens look like. Carried out with @maureendeseyssel.bsky.social
hal.science
November 10, 2025 at 9:43 PM
Reposted by Marvin Lavechin
Next we jump from analyzing text models to predictive speech models! Phoneticists have claimed for decades that humans rely more on contextual cues when processing vowels compared to consonants. Turns out so do speech models!
June 12, 2025 at 6:56 PM
@mpoli.fr check this out if you haven't read it yet! Really cool work!
May 22, 2025 at 9:21 PM