I'll see you at Rotterdam, Wed 17:00-17:20 Area8-Oral4 (Streaming ASR)! (10/10)
I'll see you at Rotterdam, Wed 17:00-17:20 Area8-Oral4 (Streaming ASR)! (10/10)
(1) High transmission efficiency of ~0.6kbps (.wav files are around 512kbps, 3-4 orders of magnitude bigger!)
(2) Easy integration with LLMs (we can say DSUs are "tokenized speech")
(3) DSUs somewhat "acts" like phonemes (4/n)
(1) High transmission efficiency of ~0.6kbps (.wav files are around 512kbps, 3-4 orders of magnitude bigger!)
(2) Easy integration with LLMs (we can say DSUs are "tokenized speech")
(3) DSUs somewhat "acts" like phonemes (4/n)
(1) We are using self-supervised models inefficiently when extracting discrete speech units (DSUs), hence can be made more lightweight.
(2) DSUs do not require full temporal receptive field, hence streamable. (2/n)
(1) We are using self-supervised models inefficiently when extracting discrete speech units (DSUs), hence can be made more lightweight.
(2) DSUs do not require full temporal receptive field, hence streamable. (2/n)
It's different with classifiers! Classifiers model P(phoneme|sound), where ours model P(sound|phoneme). (6/n)
It's different with classifiers! Classifiers model P(phoneme|sound), where ours model P(sound|phoneme). (6/n)