#DataPreprocessing
JMIR Formative Res: Preprocessing Large-Scale Conversational Datasets: A Framework and Its Application to Behavioral Health Transcripts #AI #DataScience #MachineLearning #SpeechRecognition #DataPreprocessing
Preprocessing Large-Scale Conversational Datasets: A Framework and Its Application to Behavioral Health Transcripts
Background: The rise of AI and accessible audio equipment has led to a proliferation of recorded conversation transcripts datasets across various fields. However, automatic mass recording and transcription often produce noisy, unstructured data. First, these datasets naturally include unintended recordings, such as hallway conversations, background noise and media (e.g., TV programs, radio, phone calls). Second, automatic speech recognition (ASR) and speaker diarization errors can result in misidentified words, speaker misattributions, and other transcription inaccuracies. As a result, large conversational transcript datasets require careful preprocessing and filtering to ensure their research utility. This challenge is particularly relevant in behavioral health contexts (e.g., therapy, treatment, counselling): while these transcripts offer valuable insights into patient-provider interactions, therapeutic techniques, and client progress, they must accurately represent the conversations to support meaningful research. Objective: We present a framework for preprocessing and filtering large datasets of conversational transcripts and apply it to a dataset of behavioral health transcripts from community mental health clinics across the United States. Within this framework we explore tools to efficiently filter non-sessions – transcripts of recordings in these clinics that do not reflect a behavioral treatment session but instead capture unrelated conversations or background noise. Methods: Our framework integrates basic feature extraction, human annotation, and advanced applications of large language models (LLMs). We begin by mapping transcription errors and assessing the distribution of sessions and non-sessions. Next, we identify key features to analyze how outliers help in characterizing the type of transcript. Notably, we use LLM perplexity as a measure of comprehensibility to assess transcript noise levels. Finally, we use zero-shot LLM prompting to classify transcripts as sessions or non-sessions, validating LLM decisions against expert annotations. Throughout, we prioritize data security by selecting tools that preserve anonymity and minimize the risk of data breaches. Results: Our findings demonstrated that basic statistical outliers, such as speaking rate, are associated with transcription errors and are observed more frequently in non-sessions versus sessions. Specifically, LLM perplexity can flag fragmented and non-verbal segments and is generally lower in sessions (permutation test mean difference = -258, p
dlvr.it
October 24, 2025 at 7:54 PM
Data preprocessing in data analysis and machine learning is crucial for accurate results and better models #DataPreprocessing #MachineLearning https://machinelearningmastery.com/10-python-one-liners-for-feature-selection-like-a-pro/
https://machinelearningmastery.com/10-python-one-liners-for-feature-selection-like-a-pro/
machinelearningmastery.com
May 26, 2025 at 3:05 PM