arXiv Sound
arxiv-sound.bsky.social
arXiv Sound
@arxiv-sound.bsky.social
Automated posting of sound-related articles uploaded to arxiv.org (eess.AS + cs.SD)

Source: https://github.com/dsuedholt/bsky-paperbot-sound/

Inspired by @paperposterbot.bsky.social and https://twitter.com/ArxivSound
Codec2Vec, a speech representation learning framework using discrete audio codec units, achieves competitive performance on the SUPERB benchmark with reduced storage and training time.
Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs
Wei-Cheng Tseng, David Harwath
arxiv.org
November 21, 2025 at 11:13 AM
A transformer-based method adjusts MusicXML piano score difficulty using a synthetic dataset of paired scores; the dataset and models are released.
Difficulty-Controlled Simplification of Piano Scores with Synthetic Data for Inclusive Music Education
Pedro Ramoneda, Emilia Parada-Cabaleiro, Dasaem Jeong, Xavier Serra
arxiv.org
November 21, 2025 at 10:23 AM
SUNAC, a source-aware neural audio codec, encodes individual sources from mixtures based on source type prompts; achieves competitive resynthesis/separation quality with lower cost.
SUNAC: Source-aware Unified Neural Audio Codec
Ryo Aihara, Yoshiki Masuyama, Francesco Paissan, François G. Germain, Gordon Wichern, Jonathan Le Roux
arxiv.org
November 21, 2025 at 9:33 AM
SceneGuard protects voices by adding scene-consistent background noise during training; effective against text-to-speech training attacks.
SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise
Rui Sang, Yuxuan Liu
arxiv.org
November 21, 2025 at 8:43 AM
Speech-LLM, trained on <20s audio with speaker supervision, performs streamable joint ASR and diarization on long audio using a Speaker Prompt Cache, outperforming Sortformer and DiarizationLM.
Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio
Mohan Shi, Xiong Xiao, Ruchao Fan, Shaoshi Ling, Jinyu Li
arxiv.org
November 21, 2025 at 7:53 AM
Generalized WOLA filter bank repositions subband filters, enhancing subband system identification; PT-WOLA implementation maintains complexity; MSE performance analyzed.
A Generalized Weighted Overlap-Add (WOLA) Filter Bank for Improved Subband System Identification
Mohit Sharma (Department of Electrical Engineering), Robbe Van Rompaey (Nokia Bell Labs, Antwerp, Belgium), Wouter Lanneer (Nokia Bell Labs, Antwerp, Belgium), Marc Moonen (Department of Electrical Engineering)
arxiv.org
November 21, 2025 at 7:03 AM
CustNetGC, a CNN with Custom Network Grad-CAM and CatBoost, uses spectral features (L-mHP, Spectral Slopes) from voice to predict Parkinson's Disease with 99.06% accuracy.
A Novel CustNetGC Boosted Model with Spectral Features for Parkinson's Disease Prediction
Abishek Karthik, Pandiyaraju V, Dominic Savio M, Rohit Swaminathan S
arxiv.org
November 20, 2025 at 11:33 AM
LargeSHS, a large-scale dataset of music adaptations from SecondHandSongs, contains 1.7M metadata entries and 900k audio links, enabling research in cover song generation.
LargeSHS: A large-scale dataset of music adaptation
Chih-Pin Tan, Hsuan-Kai Kao, Li Su, Yi-Hsuan Yang
arxiv.org
November 20, 2025 at 11:03 AM
Auden-Voice, a general-purpose voice encoder, balances identity and paralinguistic cues through multi-task training, demonstrating strong performance with LLMs.
Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding
Mingyue Huo, Wei-Cheng Tseng, Yiwen Shao, Hao Zhang, Dong Yu
arxiv.org
November 20, 2025 at 10:33 AM
CASTELLA, a large-scale human-annotated audio dataset for audio moment retrieval, is introduced; fine-tuning a model on CASTELLA improved performance by 10.4 points.
CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries
Hokuto Munakata, Takehiro Imamura, Taichi Nishimura, Tatsuya Komatsu
arxiv.org
November 20, 2025 at 10:03 AM
Paper advocates for preference alignment in music generation, highlighting challenges in temporal coherence and subjective quality assessment; techniques like MusicRL and DiffRhythm+ are discussed.
Aligning Generative Music AI with Human Preferences: Methods and Challenges
Dorien Herremans, Abhinaba Roy
arxiv.org
November 20, 2025 at 9:33 AM
Quality control pipeline implemented for MELD and IEMOCAP datasets; transfer learning from speaker and face recognition, with MAMBA fusion, achieved 64.8% accuracy.
Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion
Zanxu Wang, Homayoon Beigi
arxiv.org
November 20, 2025 at 9:03 AM
Fine-tuning Audio-MAE and PANNs for COVID-19 detection showed limited generalization despite demographic stratification; small dataset sizes hinder deep learning performance.
Fine-tuning Pre-trained Audio Models for COVID-19 Detection: A Technical Report
Daniel Oliveira de Brito, Letícia Gabriella de Souza, Marcelo Matheus Gauy, Marcelo Finger, Arnaldo Candido Junior
arxiv.org
November 20, 2025 at 8:33 AM
SpotlightTTS enhances expressive TTS using voiced-aware style extraction and style direction adjustment, improving expressiveness and speech quality over baselines.
Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech
Nam-Gyu Kim
arxiv.org
November 20, 2025 at 8:03 AM
IHearYou detects depression by linking voice features to DSM-5 indicators via a framework, running locally for privacy; validated on DAIC-WOZ dataset, showing consistent feature-indicator associations.
IHearYou: Linking Acoustic Features to DSM-5 Depressive Behavior Indicators
Jonas Länzlinger, Katharina Müller, Bruno Rodrigues
arxiv.org
November 20, 2025 at 7:33 AM
OBHS compresses audio by block-wise Huffman coding with canonical code representation and fallback mechanisms, achieving up to 93.6% compression with low complexity.
OBHS: An Optimized Block Huffman Scheme for Real-Time Audio Compression
Muntahi Safwan Mahfi, Md. Manzurul Hasan, Gahangir Hossain
arxiv.org
November 20, 2025 at 7:03 AM
CPFG-Net, a conditional variational autoencoder, controllably predicts perceptual features and tonal structures from melodies, generating harmonically coherent chord progressions based on a new dataset.
A Controllable Perceptual Feature Generative Model for Melody Harmonization via Conditional Variational Autoencoder
Dengyun Huang, Yonghua Zhu
arxiv.org
November 19, 2025 at 11:33 AM
IMSE replaces MET with Amplitude-Aware Linear Attention (MALA) and DE with Inception Depthwise Convolution (IDConv), reducing parameters by 16.8% compared to MUSE while maintaining performance.
IMSE: Efficient U-Net-based Speech Enhancement using Inception Depthwise Convolution and Amplitude-Aware Linear Attention
Xinxin Tang, Bin Qin, Yufang Li
arxiv.org
November 19, 2025 at 11:03 AM
TTA model, trained on 358k hours of speech data across ASR/ST and speech-text alignment tasks, produces robust cross-lingual speech representations; outperforms Whisper.
TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation
Wei Liu, Jiahong Li, Yiwen Shao, Dong Yu
arxiv.org
November 19, 2025 at 10:33 AM
AQA system uses BEATs for feature extraction and Qwen2.5-7B-Instruct fine-tuned with GRPO for audio question answering, achieving 62.6 accuracy in the DCASE 2025 Challenge.
Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions
Marcel Gibier, Nolwenn Celton, Raphaël Duroselle, Pierre Serrano, Olivier Boeffard, Jean-François Bonastre
arxiv.org
November 19, 2025 at 10:03 AM
Segmentwise pruning in audio-language models reduces computing costs by selectively retaining tokens; a time-aware strategy achieves a maximum 2% decrease in performance while pruning 75% of tokens.
Segmentwise Pruning in Audio-Language Models
Marcel Gibier, Raphaël Duroselle, Pierre Serrano, Olivier Boeffard, Jean-François Bonastre
arxiv.org
November 19, 2025 at 9:33 AM
CountEM, a novel AMT framework, uses note event histograms for supervision, refining predictions iteratively via Expectation-Maximization and reducing the need for local alignment.
Count The Notes: Histogram-Based Supervision for Automatic Music Transcription
Jonathan Yaffe, Ben Maman, Meinard Müller, Amit H. Bermano
arxiv.org
November 19, 2025 at 9:03 AM
FxSearcher, a gradient-free framework, uses Bayesian Optimization and a CLAP-based score function to find the best audio effect configurations based on a text prompt, preventing artifacts with a guiding prompt.
FxSearcher: gradient-free text-driven audio transformation
Hojoon Ki, Jongsuk Kim, Minchan Kwon, Junmo Kim
arxiv.org
November 19, 2025 at 8:33 AM
A systematic review of audio papers reveals preference learning is underexplored despite challenges in evaluating generative models; only 6% of papers consider preference learning.
Preference-Based Learning in Audio Applications: A Systematic Analysis
Aaron Broukhim, Yiran Shen, Prithviraj Ammanabrolu, Nadir Weibel
arxiv.org
November 19, 2025 at 8:03 AM
Principled Coarse-Graining (PCG) verifies speculative decoding proposals at the level of Acoustic Similarity Groups (ASGs), increasing acceptance and throughput on LibriTTS while maintaining intelligibility.
Principled Coarse-Grained Acceptance for Speculative Decoding in Speech
Moran Yanuka, Paul Dixon, Eyal Finkelshtein, Daniel Rotman, Raja Giryes
arxiv.org
November 19, 2025 at 7:33 AM