Gerard I. Gállego
geiongallego.bsky.social
Gerard I. Gállego
@geiongallego.bsky.social
Excited to share that this work was accepted to Interspeech 2025. See you in Rotterdam!
Preprint: arxiv.org/abs/2505.24691
Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios
We propose a Speech-to-Text Translation (S2TT) approach that integrates phoneme representations into a Chain-of-Thought (CoT) framework to improve translation in low-resource and zero-resource setting...
arxiv.org
June 3, 2025 at 8:53 PM
By adding phoneme recognition as an intermediate step, we improve cross-lingual transfer, even for languages with no labeled speech. The method boosts low-resource performance, with only a slight drop in high-resource scenarios.
June 3, 2025 at 8:53 PM
In my first project at BSC, we worked on improving speech-to-text translation for low-resource languages. Our paper, "Speech-to-Text Translation with Phoneme-Augmented CoT", presents an LLM-based model that integrates phoneme recognition into the CoT approach.
June 3, 2025 at 8:53 PM
This research was conducted during my internship at Dolby Labs. A special thanks to Roy Fejgin, Chunghsin Yeh, Xiaoyu Liu, and Gautam Bhattacharya for their mentorship and collaboration.
December 31, 2024 at 7:48 PM
With this approach, we demonstrate that single-stage NAR systems can perform competitively compared to more complex two-stage models, narrowing the gap in quality and intelligibility.
December 31, 2024 at 7:48 PM
Our system, NARSiS, integrates semantic and acoustic modeling into a unified, single-stage framework. Using Semantic Knowledge Distillation, we incorporate semantic guidance during training while keeping inference efficient.
December 31, 2024 at 7:48 PM