Linguistic Data Consortium
banner
ldcupenn.bsky.social
Linguistic Data Consortium
@ldcupenn.bsky.social
LDC creates and distributes language resources to universities, labs, companies and libraries for linguistic education, research and technology development.
BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and Translations: transcripts and English translations for 116 hours of BOLT CTS telephone recordings; all speech was transcribed; 99% of the transcripts were translated bit.ly/4ockuEo
October 21, 2025 at 1:29 PM
BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Audio: 116 hours of telephone speech from 274 conversations between native speakers; developed by LDC for the DARPA BOLT program; contains previously unexposed calls from the CF/CH collections bit.ly/42rsg4S
October 20, 2025 at 2:31 PM
KAIROS Phase 2 Quizlet contains English and Spanish web data annotated for events, relations and arguments, a reference knowledge graph and a knowledge base; quizlets were defined tasks to explore evaluation objectives before the full program evaluation bit.ly/3WqvYYR
October 17, 2025 at 2:47 PM
See LDC’s October newsletter for a preview of 2026 publications, fall data scholarship recipients and three new publications ldc-upenn.blogspot.com
October 16, 2025 at 3:09 PM
More LDC data in the LORELEI series: LORELEI Hindi Representative Language Pack features monolingual and parallel text, annotations, software tools and more for human language technology development to address emergent situations bit.ly/4nCp3ar
September 22, 2025 at 8:37 PM
AIDA Scenario 1 Evaluation Topic Source Data, Annotation & Assessment: 10k+ English, Russian & Ukrainian web docs on political relations between Russia & Ukraine in the 2010s annotated for entities & cross-reference, w/ judgments for scoring submissions bit.ly/3K7ynoA
September 22, 2025 at 4:00 PM
Mixer 7 English Speech has 12,321 hours of telephone conversations, interviews and transcript readings from 222 English speakers, some collected using a 14-microphone array; speaker metadata is included bit.ly/4nvSYkG
September 19, 2025 at 3:24 PM
Check out our September newsletter for three new LDC publications: Mixer 7 English Speech, AIDA Scenario 1 Evaluation Topic Source Data, Annotation and Assessment, and LORELEI Hindi Representative Language Pack ldc-upenn.blogspot.com
September 18, 2025 at 3:08 PM
KAIROS Phase 1 Quizlet contains English and Spanish web data annotated for events, relations and arguments and a reference knowledge graph; quizlets were defined tasks to explore evaluation objectives before the full program evaluation bit.ly/3HvDU7k
August 26, 2025 at 6:39 PM
Abstract Meaning Representation 2.0 - Machine Translations translates 1,371 English sentences from LDC’s AMR 2.0 corpus into Spanish, German, Italian and Mandarin Chinese using Google Translate bit.ly/4n1m8bp
August 26, 2025 at 2:50 PM
Mixer 6 - CHiME 8 Transcribed Calls and Interviews: 80 hours of Mixer 6 English interviews and telephone speech across 13 channels (1063 hours) with transcripts divided into training, development and test sets bit.ly/4oyUCn5
August 25, 2025 at 6:33 PM
LDC’s August newsletter has the last call for fall data scholarship applications and details on new publications: Mixer 6 CHiME 8 Transcribed Calls and Interviews, Abstract Meaning Representation 2.0 – Machine Translations and KAIRO Phase 1 Quizlet ldc-upenn.blogspot.com
August 25, 2025 at 1:09 PM
What a great conference #Interspeech2025! There is still time to stop by our booth and grab a limited-edition TIMIT word poetry magnet. Also don’t miss our colleague’s oral session on TELVID: A multilingual, multi-modal corpus for speaker recognition at 13:30, A04, Port 1A @interspeech.bsky.social
August 21, 2025 at 9:40 AM
Good morning #Interspeech2025 Stop by our booth during the coffee breaks today to say hello. Also don't miss today's special session co-organized by LDC on Challenges in Speech Collection, Curation and Annotation in two parts beginning at 13:30, Dock 15. @interspeech.bsky.social
August 20, 2025 at 7:11 AM
Good morning Interspeech. It's a great second day. Come by and grab one of our limited giveaways. @interspeech.bsky.social
#Interspeech2025
August 19, 2025 at 7:22 AM
We are excited to be here at Interspeech 2025 @interspeech.bsky.social‬ Come see us at the first coffee break today to learn more about the latest developments at LDC. #Interspeech2025
August 18, 2025 at 8:11 AM
LDC will be exhibiting at #Interspeech2025, August 17-21 in Rotterdam. Stop by our booth to say hello and learn the latest developments at the Consortium. LDC work will also be featured in presentations, posters and a special session. We look forward to seeing you there. www.interspeech2025.org
August 12, 2025 at 3:51 PM
From the LORELEI companion project: LoReHLT Uzbek Representative Language Pack features monolingual and parallel text, annotations, audio recordings, software tools and more for human language technology development to address emergent situations bit.ly/4lL0zuL
July 22, 2025 at 2:08 PM
Penn Parsed Corpora of Historical English Second Release: POS-tagged & syntactically annotated British English text (1100 CE -1914 CE); updates the 2020 release with new annotation, revised guidelines, philological information & the Corpus2 search tool bit.ly/46zR1hR
July 18, 2025 at 2:33 PM
AnnoDIFP Session Audio and Transcripts: 438.34 hours of English audio and transcripts from in-person interviews of 366 participants paired with scores from two self-reported personality assessments bit.ly/4nEYQJr
July 17, 2025 at 3:15 PM
Check out the July newsletter for Fall 2025 data scholarship application deadlines & 3 new publications: AnnoDIFP Session Audio and Transcripts, Penn Parsed Corpora of Historical English Second Release & LoReHLT Uzbek Representative Language Pack ldc-upenn.blogspot.com
July 16, 2025 at 2:28 PM
KAIROS Schema Learning Complex Event Annotation has English and Spanish web text, audio, video and image data labeled for 93 real-world complex events with event, relation and argument annotations linking to document provenance bit.ly/4jNrDIq
June 25, 2025 at 1:06 PM
IWSLT 2022 - 2023 Shared Task Training, Development and Test Set: 210 hours of Tunisian Arabic conversational telephone speech, transcripts, English translations, speaker metadata, and documentation used in IWSLT dialectal speech and low resource tracks bit.ly/3HEO4lL
June 24, 2025 at 2:24 PM
Chinese Sentence Pattern Structure Treebank contains 5,016 sentences and 119,627 tokens from modern and ancient Chinese works annotated for lexical sense, syntactic structure and inter-clause relations bit.ly/4kZVGh3
June 23, 2025 at 1:56 PM
LDC’s June newsletter has the latest on three new publications: Chinese Sentence Pattern Structure Treebank, IWSLT 2022-2023 Shared Task Training, Development and Test Set, and KAIROS Schema Learning Complex Event Annotation ldc-upenn.blogspot.com
June 17, 2025 at 1:38 PM