Lightnews — Scholar-powered news

arXiv Sound

@arxiv-sound.bsky.social

Codec2Vec, a speech representation learning framework using discrete audio codec units, achieves competitive performance on the SUPERB benchmark with reduced storage and training time.

Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs

Wei-Cheng Tseng, David Harwath

arxiv.org

November 21, 2025 at 11:13 AM

arXiv Sound

@arxiv-sound.bsky.social

A transformer-based method adjusts MusicXML piano score difficulty using a synthetic dataset of paired scores; the dataset and models are released.

Difficulty-Controlled Simplification of Piano Scores with Synthetic Data for Inclusive Music Education

Pedro Ramoneda, Emilia Parada-Cabaleiro, Dasaem Jeong, Xavier Serra

arxiv.org

November 21, 2025 at 10:23 AM

arXiv Sound

@arxiv-sound.bsky.social

SUNAC, a source-aware neural audio codec, encodes individual sources from mixtures based on source type prompts; achieves competitive resynthesis/separation quality with lower cost.

SUNAC: Source-aware Unified Neural Audio Codec

Ryo Aihara, Yoshiki Masuyama, Francesco Paissan, François G. Germain, Gordon Wichern, Jonathan Le Roux

arxiv.org

November 21, 2025 at 9:33 AM

arXiv Sound

@arxiv-sound.bsky.social

SceneGuard protects voices by adding scene-consistent background noise during training; effective against text-to-speech training attacks.

SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise

Rui Sang, Yuxuan Liu

arxiv.org

November 21, 2025 at 8:43 AM

arXiv Sound

@arxiv-sound.bsky.social

Speech-LLM, trained on <20s audio with speaker supervision, performs streamable joint ASR and diarization on long audio using a Speaker Prompt Cache, outperforming Sortformer and DiarizationLM.

Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio

Mohan Shi, Xiong Xiao, Ruchao Fan, Shaoshi Ling, Jinyu Li

arxiv.org

November 21, 2025 at 7:53 AM

arXiv Sound

@arxiv-sound.bsky.social

Generalized WOLA filter bank repositions subband filters, enhancing subband system identification; PT-WOLA implementation maintains complexity; MSE performance analyzed.

A Generalized Weighted Overlap-Add (WOLA) Filter Bank for Improved Subband System Identification

Mohit Sharma (Department of Electrical Engineering), Robbe Van Rompaey (Nokia Bell Labs, Antwerp, Belgium), Wouter Lanneer (Nokia Bell Labs, Antwerp, Belgium), Marc Moonen (Department of Electrical Engineering)

arxiv.org

November 21, 2025 at 7:03 AM

arXiv Sound

@arxiv-sound.bsky.social

CustNetGC, a CNN with Custom Network Grad-CAM and CatBoost, uses spectral features (L-mHP, Spectral Slopes) from voice to predict Parkinson's Disease with 99.06% accuracy.

A Novel CustNetGC Boosted Model with Spectral Features for Parkinson's Disease Prediction

Abishek Karthik, Pandiyaraju V, Dominic Savio M, Rohit Swaminathan S

arxiv.org

November 20, 2025 at 11:33 AM

arXiv Sound

@arxiv-sound.bsky.social

LargeSHS, a large-scale dataset of music adaptations from SecondHandSongs, contains 1.7M metadata entries and 900k audio links, enabling research in cover song generation.

LargeSHS: A large-scale dataset of music adaptation

Chih-Pin Tan, Hsuan-Kai Kao, Li Su, Yi-Hsuan Yang

arxiv.org

November 20, 2025 at 11:03 AM

arXiv Sound

@arxiv-sound.bsky.social

Auden-Voice, a general-purpose voice encoder, balances identity and paralinguistic cues through multi-task training, demonstrating strong performance with LLMs.

Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding

Mingyue Huo, Wei-Cheng Tseng, Yiwen Shao, Hao Zhang, Dong Yu

arxiv.org

November 20, 2025 at 10:33 AM

arXiv Sound

@arxiv-sound.bsky.social

CASTELLA, a large-scale human-annotated audio dataset for audio moment retrieval, is introduced; fine-tuning a model on CASTELLA improved performance by 10.4 points.

CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries

Hokuto Munakata, Takehiro Imamura, Taichi Nishimura, Tatsuya Komatsu

arxiv.org

November 20, 2025 at 10:03 AM

arXiv Sound

@arxiv-sound.bsky.social

Paper advocates for preference alignment in music generation, highlighting challenges in temporal coherence and subjective quality assessment; techniques like MusicRL and DiffRhythm+ are discussed.

Aligning Generative Music AI with Human Preferences: Methods and Challenges

Dorien Herremans, Abhinaba Roy

arxiv.org

November 20, 2025 at 9:33 AM

arXiv Sound

@arxiv-sound.bsky.social

Quality control pipeline implemented for MELD and IEMOCAP datasets; transfer learning from speaker and face recognition, with MAMBA fusion, achieved 64.8% accuracy.

Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion

Zanxu Wang, Homayoon Beigi

arxiv.org

November 20, 2025 at 9:03 AM

arXiv Sound

@arxiv-sound.bsky.social

Fine-tuning Audio-MAE and PANNs for COVID-19 detection showed limited generalization despite demographic stratification; small dataset sizes hinder deep learning performance.

Fine-tuning Pre-trained Audio Models for COVID-19 Detection: A Technical Report

Daniel Oliveira de Brito, Letícia Gabriella de Souza, Marcelo Matheus Gauy, Marcelo Finger, Arnaldo Candido Junior

arxiv.org

November 20, 2025 at 8:33 AM

arXiv Sound

@arxiv-sound.bsky.social

SpotlightTTS enhances expressive TTS using voiced-aware style extraction and style direction adjustment, improving expressiveness and speech quality over baselines.

Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech

Nam-Gyu Kim

arxiv.org

November 20, 2025 at 8:03 AM

arXiv Sound

@arxiv-sound.bsky.social

IHearYou detects depression by linking voice features to DSM-5 indicators via a framework, running locally for privacy; validated on DAIC-WOZ dataset, showing consistent feature-indicator associations.

IHearYou: Linking Acoustic Features to DSM-5 Depressive Behavior Indicators

Jonas Länzlinger, Katharina Müller, Bruno Rodrigues

arxiv.org

November 20, 2025 at 7:33 AM

arXiv Sound

@arxiv-sound.bsky.social

OBHS compresses audio by block-wise Huffman coding with canonical code representation and fallback mechanisms, achieving up to 93.6% compression with low complexity.

OBHS: An Optimized Block Huffman Scheme for Real-Time Audio Compression

Muntahi Safwan Mahfi, Md. Manzurul Hasan, Gahangir Hossain

arxiv.org

November 20, 2025 at 7:03 AM

arXiv Sound

@arxiv-sound.bsky.social

CPFG-Net, a conditional variational autoencoder, controllably predicts perceptual features and tonal structures from melodies, generating harmonically coherent chord progressions based on a new dataset.

A Controllable Perceptual Feature Generative Model for Melody Harmonization via Conditional Variational Autoencoder

Dengyun Huang, Yonghua Zhu

arxiv.org

November 19, 2025 at 11:33 AM

arXiv Sound

@arxiv-sound.bsky.social

IMSE replaces MET with Amplitude-Aware Linear Attention (MALA) and DE with Inception Depthwise Convolution (IDConv), reducing parameters by 16.8% compared to MUSE while maintaining performance.

IMSE: Efficient U-Net-based Speech Enhancement using Inception Depthwise Convolution and Amplitude-Aware Linear Attention

Xinxin Tang, Bin Qin, Yufang Li

arxiv.org

November 19, 2025 at 11:03 AM

arXiv Sound

@arxiv-sound.bsky.social

TTA model, trained on 358k hours of speech data across ASR/ST and speech-text alignment tasks, produces robust cross-lingual speech representations; outperforms Whisper.

TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation

Wei Liu, Jiahong Li, Yiwen Shao, Dong Yu

arxiv.org

November 19, 2025 at 10:33 AM

arXiv Sound

@arxiv-sound.bsky.social

AQA system uses BEATs for feature extraction and Qwen2.5-7B-Instruct fine-tuned with GRPO for audio question answering, achieving 62.6 accuracy in the DCASE 2025 Challenge.

Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions

Marcel Gibier, Nolwenn Celton, Raphaël Duroselle, Pierre Serrano, Olivier Boeffard, Jean-François Bonastre

arxiv.org

November 19, 2025 at 10:03 AM

arXiv Sound

@arxiv-sound.bsky.social

Segmentwise pruning in audio-language models reduces computing costs by selectively retaining tokens; a time-aware strategy achieves a maximum 2% decrease in performance while pruning 75% of tokens.

Segmentwise Pruning in Audio-Language Models

Marcel Gibier, Raphaël Duroselle, Pierre Serrano, Olivier Boeffard, Jean-François Bonastre

arxiv.org

November 19, 2025 at 9:33 AM

arXiv Sound

@arxiv-sound.bsky.social

CountEM, a novel AMT framework, uses note event histograms for supervision, refining predictions iteratively via Expectation-Maximization and reducing the need for local alignment.

Count The Notes: Histogram-Based Supervision for Automatic Music Transcription

Jonathan Yaffe, Ben Maman, Meinard Müller, Amit H. Bermano

arxiv.org

November 19, 2025 at 9:03 AM

arXiv Sound

@arxiv-sound.bsky.social

FxSearcher, a gradient-free framework, uses Bayesian Optimization and a CLAP-based score function to find the best audio effect configurations based on a text prompt, preventing artifacts with a guiding prompt.

FxSearcher: gradient-free text-driven audio transformation

Hojoon Ki, Jongsuk Kim, Minchan Kwon, Junmo Kim

arxiv.org

November 19, 2025 at 8:33 AM

arXiv Sound

@arxiv-sound.bsky.social

A systematic review of audio papers reveals preference learning is underexplored despite challenges in evaluating generative models; only 6% of papers consider preference learning.

Preference-Based Learning in Audio Applications: A Systematic Analysis

Aaron Broukhim, Yiran Shen, Prithviraj Ammanabrolu, Nadir Weibel

arxiv.org

November 19, 2025 at 8:03 AM

arXiv Sound

@arxiv-sound.bsky.social

Principled Coarse-Graining (PCG) verifies speculative decoding proposals at the level of Acoustic Similarity Groups (ASGs), increasing acceptance and throughput on LibriTTS while maintaining intelligibility.

Principled Coarse-Grained Acceptance for Speculative Decoding in Speech

Moran Yanuka, Paul Dixon, Eyal Finkelshtein, Daniel Rotman, Raja Giryes

arxiv.org

November 19, 2025 at 7:33 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news