Lightnews — Scholar-powered news

Marvin Lavechin

@marvinlavechin.bsky.social

Perhaps you want to look at the Hungarian algorithm which is used to compute diarization error rate

November 12, 2025 at 11:31 PM

Marvin Lavechin

@marvinlavechin.bsky.social

Or mark it with "false alarm" if the overlap percentage is too low.

This has the advantage of being easy to explain/implement and the disadvantage of being quite inefficient.

You could also add a speaker criterion if you want to matched segments both based on onset/offset and speaker identity

November 12, 2025 at 11:31 PM

Marvin Lavechin

@marvinlavechin.bsky.social

For each automatic segment, compute the percentage of overlap with your individual manual segments (say automatic 3 overlaps 80% with manual 4 and 15% with manual 3).

Assign this automatic segment to the manual segment with the greatest overlap.

November 12, 2025 at 11:31 PM

Marvin Lavechin

@marvinlavechin.bsky.social

I don't think there's a standard way to do this. I'd go with something like this, assuming manual and automatic segments are labeled as SPEECH (regardless of who speaks)

Assign all of your manual segments a number: 1, 2, 3, 4,...

Assign all of your automatic segments a number: 1, 2, 3, 4,...

November 12, 2025 at 11:31 PM

Marvin Lavechin

@marvinlavechin.bsky.social

For a beginner-friendly intro to some of these metrics, this is what I tried to do in osf.io/preprints/ps... or Alex Cristia in link.springer.com/article/10.3...

OSF

osf.io

November 12, 2025 at 4:26 PM

Marvin Lavechin

@marvinlavechin.bsky.social

Including precision, recall, f-score and diarization error rate.

November 12, 2025 at 4:26 PM

Marvin Lavechin

@marvinlavechin.bsky.social

@fusaroli.bsky.social if you're familiar with python, pyannote.metrics is the perfect library. It comes with multiple performance metrics to evaluate speaker segmentation/diarization algorithms.

pyannote.github.io/pyannote-met...

@hbredin.bsky.social 🙏

pyannote.metrics — pyannote.metrics 4.0.1.dev0+g304c107ca.d20250909 documentation

pyannote.github.io

November 12, 2025 at 4:26 PM

Marvin Lavechin

@marvinlavechin.bsky.social

Right, or if tokens (as in discrete categories) per-se are the right way to represent things

November 10, 2025 at 11:21 PM

Marvin Lavechin

@marvinlavechin.bsky.social

We'll finally get to see if models are on par with children 🍿
I bet no 🙃

November 10, 2025 at 11:20 PM

Marvin Lavechin

@marvinlavechin.bsky.social

Yes! What's exciting with the First1KDays approach to me is that for the 1st time, we have dense naturalistic data across the full first 3 years AND the computational scale to model it. Can't hide behind the "sparse dataset" or "non-plausible data" argument anymore.

November 10, 2025 at 11:20 PM

Marvin Lavechin

@marvinlavechin.bsky.social

5) Can't skip Jim Glass from MIT who has been working on automatic spoken language understanding for >30 years (including audiovisual models and recent SSL architectures).

The hype around LMs is very new, but there's a long tradition of using them to model lang. acq. in children.

November 10, 2025 at 10:11 PM

Marvin Lavechin

@marvinlavechin.bsky.social

4) Afra Alishahi & @grzegorz.chrupala.me for models that learn English from Peppa Pig 🐷 direct.mit.edu/tacl/article...

and many more that I missed!

Learning English with Peppa Pig

Abstract. Recent computational models of the acquisition of spoken language via grounding in perception exploit associations between spoken and visual modalities and learn to represent speech and visu...

direct.mit.edu

November 10, 2025 at 9:43 PM

Marvin Lavechin

@marvinlavechin.bsky.social

3) Really cool work from Okko Räsänen on the acquisition of phonemes, words, and word meanings from audiovisual input escholarship.org/uc/item/79t0...

Computational Insights to Acquisition of Phonemes, Words, and Word Meanings in Early Language: Sequential or Parallel Acquisition?

Author(s): Khorrami, Khazar; Cruz Blandón, María Andrea; Räsänen, Okko | Abstract: Previous computational models of early language acquisition have shown how linguistic structure of speech can be acqu...

escholarship.org

November 10, 2025 at 9:43 PM

Marvin Lavechin

@marvinlavechin.bsky.social

2) hal.science/hal-04876433...
That model is directly trained on child-centered long-form recordings (those collected by @bergelsonlab.bsky.social); spoiler: learning from spontaneous & noisy speech makes the problem even more difficult, but also more interesting in my opinion!

hal.science

November 10, 2025 at 9:43 PM

Marvin Lavechin

@marvinlavechin.bsky.social

The community is smaller, but there's work out there. A few examples:

1) onlinelibrary.wiley.com/doi/10.1111/... on early sound and word (form) acquisition in SSL models; many analyses about what the learned tokens look like. Carried out with @maureendeseyssel.bsky.social

hal.science

November 10, 2025 at 9:43 PM

Reposted by Marvin Lavechin

Naomi Saphra

@nsaphra.bsky.social

Next we jump from analyzing text models to predictive speech models! Phoneticists have claimed for decades that humans rely more on contextual cues when processing vowels compared to consonants. Turns out so do speech models!

June 12, 2025 at 6:56 PM

Marvin Lavechin

@marvinlavechin.bsky.social

@mpoli.fr check this out if you haven't read it yet! Really cool work!

May 22, 2025 at 9:21 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news