Lightnews — Scholar-powered news

Gerard I. Gállego

@geiongallego.bsky.social

Excited to share that this work was accepted to Interspeech 2025. See you in Rotterdam!
Preprint: arxiv.org/abs/2505.24691

Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios

We propose a Speech-to-Text Translation (S2TT) approach that integrates phoneme representations into a Chain-of-Thought (CoT) framework to improve translation in low-resource and zero-resource setting...

arxiv.org

June 3, 2025 at 8:53 PM

Gerard I. Gállego

@geiongallego.bsky.social

By adding phoneme recognition as an intermediate step, we improve cross-lingual transfer, even for languages with no labeled speech. The method boosts low-resource performance, with only a slight drop in high-resource scenarios.

June 3, 2025 at 8:53 PM

Gerard I. Gállego

@geiongallego.bsky.social

In my first project at BSC, we worked on improving speech-to-text translation for low-resource languages. Our paper, "Speech-to-Text Translation with Phoneme-Augmented CoT", presents an LLM-based model that integrates phoneme recognition into the CoT approach.

June 3, 2025 at 8:53 PM

Gerard I. Gállego

@geiongallego.bsky.social

Wishing everyone a Happy New Year! Stay tuned for this work to be presented at #ICASSP2025.

arxiv.org/abs/2409.11003

Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation

Audio token modeling has become a powerful framework for speech synthesis, with two-stage approaches employing semantic tokens remaining prevalent. In this paper, we aim to simplify this process by in...

arxiv.org

December 31, 2024 at 7:48 PM

Gerard I. Gállego

@geiongallego.bsky.social

This research was conducted during my internship at Dolby Labs. A special thanks to Roy Fejgin, Chunghsin Yeh, Xiaoyu Liu, and Gautam Bhattacharya for their mentorship and collaboration.

December 31, 2024 at 7:48 PM

Gerard I. Gállego

@geiongallego.bsky.social

With this approach, we demonstrate that single-stage NAR systems can perform competitively compared to more complex two-stage models, narrowing the gap in quality and intelligibility.

December 31, 2024 at 7:48 PM

Gerard I. Gállego

@geiongallego.bsky.social

Our system, NARSiS, integrates semantic and acoustic modeling into a unified, single-stage framework. Using Semantic Knowledge Distillation, we incorporate semantic guidance during training while keeping inference efficient.

December 31, 2024 at 7:48 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news