Linguistic Data Consortium
banner
ldcupenn.bsky.social
Linguistic Data Consortium
@ldcupenn.bsky.social
LDC creates and distributes language resources to universities, labs, companies and libraries for linguistic education, research and technology development.
MATERIAL Swahili-English Language Pack has 112 hours of Swahili conversational telephone speech, transcripts, English translations, annotations and queries designed to support cross language information retrieval bit.ly/49SWG3R
January 23, 2026 at 4:45 PM
CALLHOME Japanese Lexicon Second Edition: morphological, phonological and stress information for 80,688 Japanese words from transcripts of telephone conversations between native Japanese speakers, along with a pronunciation dictionary and G2P tools bit.ly/3NlxvhC
January 22, 2026 at 3:28 PM
CALLHOME Japanese Second Edition brings original speech and transcript datasets up to date with new transcripts and revised directories, file formats and documentation bit.ly/49kSdqz
January 21, 2026 at 3:27 PM
LDC welcomes 2026 with its January newsletter featuring three publications and membership renewal information ldc-upenn.blogspot.com
January 20, 2026 at 3:20 PM
LORELEI Sinhala Incident Language Pack: monolingual and parallel text, annotations, software tools and more for human language technology development in this under-resourced language bit.ly/4iVnJP1
December 18, 2025 at 3:46 PM
2021 NIST SRE Test Set: 447 hours of Cantonese, Mandarin, and English conversational telephone speech, audio from video, and selfie image data for development and test, along with answer keys, enrollment, trial files and documentation bit.ly/4q35JV4
December 17, 2025 at 3:43 PM
Check out LDC’s December’s newsletter for the latest news and publications and join us in celebrating the release of our 1000th corpus! ldc-upenn.blogspot.com
December 16, 2025 at 3:38 PM
Check out ISCA-SAC’s Speech Pitch podcast to hear from LDC’s Denise DiPersio #18.9. This session was recorded during Interspeech 2025. Listen to Denise talk about LDC’s past, present and future and LDC’s involvement in Interspeech since the 2009 conference in Brighton. tinyurl.com/488rske4
#18.9 Interspeech 2025 Impressions - Denise Dipersio
Meet Denise Dipersio Associate Director at Linguistic Data Consortium sharing her experience with us. Host: Pascal Hecker Post-production: Wei Xue
tinyurl.com
December 5, 2025 at 3:00 PM
LORELEI Ilocano Incident Language Pack: monolingual and parallel text, annotations, software tools and more for human language technology development in this under-resourced language bit.ly/43moVEw
November 20, 2025 at 2:58 PM
AnnoDIFP CTS Audio and Transcripts: 242.52 hours of English telephone audio and transcripts from 1179 calls involving 327 participants, paired with scores from two self-reported personality assessments bit.ly/47J6JHX
November 19, 2025 at 3:13 PM
LDC’s November newsletter has details on 2026 membership renewal, the spring data scholarship deadline and two new publications ldc-upenn.blogspot.com
November 18, 2025 at 2:46 PM
BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and Translations: transcripts and English translations for 116 hours of BOLT CTS telephone recordings; all speech was transcribed; 99% of the transcripts were translated bit.ly/4ockuEo
October 21, 2025 at 1:29 PM
BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Audio: 116 hours of telephone speech from 274 conversations between native speakers; developed by LDC for the DARPA BOLT program; contains previously unexposed calls from the CF/CH collections bit.ly/42rsg4S
October 20, 2025 at 2:31 PM
KAIROS Phase 2 Quizlet contains English and Spanish web data annotated for events, relations and arguments, a reference knowledge graph and a knowledge base; quizlets were defined tasks to explore evaluation objectives before the full program evaluation bit.ly/3WqvYYR
October 17, 2025 at 2:47 PM
See LDC’s October newsletter for a preview of 2026 publications, fall data scholarship recipients and three new publications ldc-upenn.blogspot.com
October 16, 2025 at 3:09 PM
More LDC data in the LORELEI series: LORELEI Hindi Representative Language Pack features monolingual and parallel text, annotations, software tools and more for human language technology development to address emergent situations bit.ly/4nCp3ar
September 22, 2025 at 8:37 PM
AIDA Scenario 1 Evaluation Topic Source Data, Annotation & Assessment: 10k+ English, Russian & Ukrainian web docs on political relations between Russia & Ukraine in the 2010s annotated for entities & cross-reference, w/ judgments for scoring submissions bit.ly/3K7ynoA
September 22, 2025 at 4:00 PM
Mixer 7 English Speech has 12,321 hours of telephone conversations, interviews and transcript readings from 222 English speakers, some collected using a 14-microphone array; speaker metadata is included bit.ly/4nvSYkG
September 19, 2025 at 3:24 PM
Check out our September newsletter for three new LDC publications: Mixer 7 English Speech, AIDA Scenario 1 Evaluation Topic Source Data, Annotation and Assessment, and LORELEI Hindi Representative Language Pack ldc-upenn.blogspot.com
September 18, 2025 at 3:08 PM
KAIROS Phase 1 Quizlet contains English and Spanish web data annotated for events, relations and arguments and a reference knowledge graph; quizlets were defined tasks to explore evaluation objectives before the full program evaluation bit.ly/3HvDU7k
August 26, 2025 at 6:39 PM
Abstract Meaning Representation 2.0 - Machine Translations translates 1,371 English sentences from LDC’s AMR 2.0 corpus into Spanish, German, Italian and Mandarin Chinese using Google Translate bit.ly/4n1m8bp
August 26, 2025 at 2:50 PM
Mixer 6 - CHiME 8 Transcribed Calls and Interviews: 80 hours of Mixer 6 English interviews and telephone speech across 13 channels (1063 hours) with transcripts divided into training, development and test sets bit.ly/4oyUCn5
August 25, 2025 at 6:33 PM
LDC’s August newsletter has the last call for fall data scholarship applications and details on new publications: Mixer 6 CHiME 8 Transcribed Calls and Interviews, Abstract Meaning Representation 2.0 – Machine Translations and KAIRO Phase 1 Quizlet ldc-upenn.blogspot.com
August 25, 2025 at 1:09 PM
What a great conference #Interspeech2025! There is still time to stop by our booth and grab a limited-edition TIMIT word poetry magnet. Also don’t miss our colleague’s oral session on TELVID: A multilingual, multi-modal corpus for speaker recognition at 13:30, A04, Port 1A @interspeech.bsky.social
August 21, 2025 at 9:40 AM
Good morning #Interspeech2025 Stop by our booth during the coffee breaks today to say hello. Also don't miss today's special session co-organized by LDC on Challenges in Speech Collection, Curation and Annotation in two parts beginning at 13:30, Dock 15. @interspeech.bsky.social
August 20, 2025 at 7:11 AM