Lightnews — Scholar-powered news

@ldcupenn.bsky.social

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and Translations: transcripts and English translations for 116 hours of BOLT CTS telephone recordings; all speech was transcribed; 99% of the transcripts were translated bit.ly/4ockuEo

October 21, 2025 at 1:29 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Audio: 116 hours of telephone speech from 274 conversations between native speakers; developed by LDC for the DARPA BOLT program; contains previously unexposed calls from the CF/CH collections bit.ly/42rsg4S

October 20, 2025 at 2:31 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

KAIROS Phase 2 Quizlet contains English and Spanish web data annotated for events, relations and arguments, a reference knowledge graph and a knowledge base; quizlets were defined tasks to explore evaluation objectives before the full program evaluation bit.ly/3WqvYYR

October 17, 2025 at 2:47 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

See LDC’s October newsletter for a preview of 2026 publications, fall data scholarship recipients and three new publications ldc-upenn.blogspot.com

October 16, 2025 at 3:09 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

More LDC data in the LORELEI series: LORELEI Hindi Representative Language Pack features monolingual and parallel text, annotations, software tools and more for human language technology development to address emergent situations bit.ly/4nCp3ar

September 22, 2025 at 8:37 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

AIDA Scenario 1 Evaluation Topic Source Data, Annotation & Assessment: 10k+ English, Russian & Ukrainian web docs on political relations between Russia & Ukraine in the 2010s annotated for entities & cross-reference, w/ judgments for scoring submissions bit.ly/3K7ynoA

September 22, 2025 at 4:00 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

Mixer 7 English Speech has 12,321 hours of telephone conversations, interviews and transcript readings from 222 English speakers, some collected using a 14-microphone array; speaker metadata is included bit.ly/4nvSYkG

September 19, 2025 at 3:24 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

Check out our September newsletter for three new LDC publications: Mixer 7 English Speech, AIDA Scenario 1 Evaluation Topic Source Data, Annotation and Assessment, and LORELEI Hindi Representative Language Pack ldc-upenn.blogspot.com

September 18, 2025 at 3:08 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

KAIROS Phase 1 Quizlet contains English and Spanish web data annotated for events, relations and arguments and a reference knowledge graph; quizlets were defined tasks to explore evaluation objectives before the full program evaluation bit.ly/3HvDU7k

August 26, 2025 at 6:39 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

Abstract Meaning Representation 2.0 - Machine Translations translates 1,371 English sentences from LDC’s AMR 2.0 corpus into Spanish, German, Italian and Mandarin Chinese using Google Translate bit.ly/4n1m8bp

August 26, 2025 at 2:50 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

Mixer 6 - CHiME 8 Transcribed Calls and Interviews: 80 hours of Mixer 6 English interviews and telephone speech across 13 channels (1063 hours) with transcripts divided into training, development and test sets bit.ly/4oyUCn5

August 25, 2025 at 6:33 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

LDC’s August newsletter has the last call for fall data scholarship applications and details on new publications: Mixer 6 CHiME 8 Transcribed Calls and Interviews, Abstract Meaning Representation 2.0 – Machine Translations and KAIRO Phase 1 Quizlet ldc-upenn.blogspot.com

August 25, 2025 at 1:09 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

What a great conference #Interspeech2025! There is still time to stop by our booth and grab a limited-edition TIMIT word poetry magnet. Also don’t miss our colleague’s oral session on TELVID: A multilingual, multi-modal corpus for speaker recognition at 13:30, A04, Port 1A @interspeech.bsky.social

August 21, 2025 at 9:40 AM

Linguistic Data Consortium

@ldcupenn.bsky.social

Good morning #Interspeech2025 Stop by our booth during the coffee breaks today to say hello. Also don't miss today's special session co-organized by LDC on Challenges in Speech Collection, Curation and Annotation in two parts beginning at 13:30, Dock 15. @interspeech.bsky.social

August 20, 2025 at 7:11 AM

Linguistic Data Consortium

@ldcupenn.bsky.social

Good morning Interspeech. It's a great second day. Come by and grab one of our limited giveaways. @interspeech.bsky.social
#Interspeech2025

August 19, 2025 at 7:22 AM

Linguistic Data Consortium

@ldcupenn.bsky.social

We are excited to be here at Interspeech 2025 @interspeech.bsky.social‬ Come see us at the first coffee break today to learn more about the latest developments at LDC. #Interspeech2025

August 18, 2025 at 8:11 AM

Linguistic Data Consortium

@ldcupenn.bsky.social

LDC will be exhibiting at #Interspeech2025, August 17-21 in Rotterdam. Stop by our booth to say hello and learn the latest developments at the Consortium. LDC work will also be featured in presentations, posters and a special session. We look forward to seeing you there. www.interspeech2025.org

August 12, 2025 at 3:51 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

From the LORELEI companion project: LoReHLT Uzbek Representative Language Pack features monolingual and parallel text, annotations, audio recordings, software tools and more for human language technology development to address emergent situations bit.ly/4lL0zuL

July 22, 2025 at 2:08 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

Penn Parsed Corpora of Historical English Second Release: POS-tagged & syntactically annotated British English text (1100 CE -1914 CE); updates the 2020 release with new annotation, revised guidelines, philological information & the Corpus2 search tool bit.ly/46zR1hR

July 18, 2025 at 2:33 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

AnnoDIFP Session Audio and Transcripts: 438.34 hours of English audio and transcripts from in-person interviews of 366 participants paired with scores from two self-reported personality assessments bit.ly/4nEYQJr

July 17, 2025 at 3:15 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

Check out the July newsletter for Fall 2025 data scholarship application deadlines & 3 new publications: AnnoDIFP Session Audio and Transcripts, Penn Parsed Corpora of Historical English Second Release & LoReHLT Uzbek Representative Language Pack ldc-upenn.blogspot.com

July 16, 2025 at 2:28 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

KAIROS Schema Learning Complex Event Annotation has English and Spanish web text, audio, video and image data labeled for 93 real-world complex events with event, relation and argument annotations linking to document provenance bit.ly/4jNrDIq

June 25, 2025 at 1:06 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

IWSLT 2022 - 2023 Shared Task Training, Development and Test Set: 210 hours of Tunisian Arabic conversational telephone speech, transcripts, English translations, speaker metadata, and documentation used in IWSLT dialectal speech and low resource tracks bit.ly/3HEO4lL

June 24, 2025 at 2:24 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

Chinese Sentence Pattern Structure Treebank contains 5,016 sentences and 119,627 tokens from modern and ancient Chinese works annotated for lexical sense, syntactic structure and inter-clause relations bit.ly/4kZVGh3

June 23, 2025 at 1:56 PM

Linguistic Data Consortium

@ldcupenn.bsky.social

LDC’s June newsletter has the latest on three new publications: Chinese Sentence Pattern Structure Treebank, IWSLT 2022-2023 Shared Task Training, Development and Test Set, and KAIROS Schema Learning Complex Event Annotation ldc-upenn.blogspot.com

June 17, 2025 at 1:38 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news