Lightnews — Scholar-powered news

Kwanghee Choi

@juice500ml.bsky.social

This wouldn't have been possible with my awesome co-first-author @mmiagshatoy.bsky.social and wonderful supervisors @shinjiw.bsky.social and @strubell.bsky.social!
I'll see you at Rotterdam, Wed 17:00-17:20 Area8-Oral4 (Streaming ASR)! (10/10)

August 15, 2025 at 8:44 PM

Kwanghee Choi

@juice500ml.bsky.social

There's also bunch of engineering tricks that can improve the performance. We provide a pareto-optimal baseline after applying all the available tricks, positioning our work as a foundation for future works in this direction. github.com/Masao-Someki... (9/n)

August 15, 2025 at 8:44 PM

Kwanghee Choi

@juice500ml.bsky.social

We also verified that DSUs are learnable with smaller weights (# of layers), i.e., more lightweight! This implies that we're using self-supervised models inefficiently when extracting DSUs. (8/n)

August 15, 2025 at 8:44 PM

Kwanghee Choi

@juice500ml.bsky.social

We verified that DSUs are learnable with limited attention size (window size), i.e., streamable! This implies that DSUs are temporally "local". (7/n)

August 15, 2025 at 8:44 PM

Kwanghee Choi

@juice500ml.bsky.social

After modifying the architecture, we fine-tune it with the DSUs extracted from the original full model. We're now understanding DSUs as "ground truth" for smaller models. (6/n)

August 15, 2025 at 8:44 PM

Kwanghee Choi

@juice500ml.bsky.social

However, the underlying Transformer model is heavy and non-streamable. We make the model more lightweight (via reducing # of layers) and streamable (via streaming window). (5/n)

August 15, 2025 at 8:44 PM

Kwanghee Choi

@juice500ml.bsky.social

Why DSUs?
(1) High transmission efficiency of ~0.6kbps (.wav files are around 512kbps, 3-4 orders of magnitude bigger!)
(2) Easy integration with LLMs (we can say DSUs are "tokenized speech")
(3) DSUs somewhat "acts" like phonemes (4/n)

August 15, 2025 at 8:44 PM

Kwanghee Choi

@juice500ml.bsky.social

A whirlwind overview of discrete speech units (DSUs): we first train a Transformer model with self-supervision (i.e., self-supervised speech models, S3Ms). Then, we simply apply k-means on top of it. Then, the k-means cluster indices becomes DSUs! (3/n)

August 15, 2025 at 8:44 PM

Kwanghee Choi

@juice500ml.bsky.social

In short, yes! Long story short:
(1) We are using self-supervised models inefficiently when extracting discrete speech units (DSUs), hence can be made more lightweight.
(2) DSUs do not require full temporal receptive field, hence streamable. (2/n)

August 15, 2025 at 8:44 PM

Kwanghee Choi

@juice500ml.bsky.social

Check out my presentation and poster for more details. I'll see you at NAACL, 4/30 14:00-15:30 Poster Session C! youtu.be/ZRF4u1eThJM (9/9)

April 29, 2025 at 5:00 PM

Kwanghee Choi

@juice500ml.bsky.social

We provide all the code and additional textgrids for everyone to use! github.com/juice500ml/a... (8/n)

GitHub - juice500ml/acoustic-units-for-ood: Official implementation for the paper "Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment (NAACL 2025)"

Official implementation for the paper "Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment (NAACL 2025)" - juice500ml/acoustic-units-for-ood

github.com

April 29, 2025 at 5:00 PM

Kwanghee Choi

@juice500ml.bsky.social

We provide an extensive benchmark containing both pathological and non-native speech, with 8 different methods and 4 different speech features. It measures how well does the speech features model each phonemes accurately. (7/n)

April 29, 2025 at 5:00 PM

Kwanghee Choi

@juice500ml.bsky.social

Based on the observation, we found out that using k-means + Gaussian Mixture Models (GMMs) are actually quite effective for modeling sound distributions.
It's different with classifiers! Classifiers model P(phoneme|sound), where ours model P(sound|phoneme). (6/n)

April 29, 2025 at 5:00 PM

Kwanghee Choi

@juice500ml.bsky.social

So, why is allophony important? We have to model each phonemes accurately for the atypical speech assessment task. It has direct applications to non-native and pathological speech assessment. (5/n)

April 29, 2025 at 5:00 PM

Kwanghee Choi

@juice500ml.bsky.social

Compared to traditional speech features like MFCC or Mel Spectrograms, self-supervised features are much superior in capturing allophony. (4/n)

April 29, 2025 at 5:00 PM

Kwanghee Choi

@juice500ml.bsky.social

A quick background on linguistics: this is supposed to happen! A single phoneme may have multiple realizations. For example, English /t/ is pronounced differently per context: [tʰ] in tap, [t] in stop, [ɾ] in butter, and [ʔ] in kitten. (3/n)

April 29, 2025 at 5:00 PM

Kwanghee Choi

@juice500ml.bsky.social

In short, yes! Even though self-supervised speech models are trained only from raw speech, they cluster via allophonic variations, i.e., different surrounding phonetic environments. (2/n)

April 29, 2025 at 5:00 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news