Laurie Burchell
banner
very-laurie.bsky.social
Laurie Burchell
@very-laurie.bsky.social
Senior Research Engineer with the Common Crawl Foundation.
(languages ∪ tech) in Dùn Èideann
Pinned
I'm part of this! There's also a paper: arxiv.org/abs/2503.10267
** New parallel data set ** . We've just released HPLT v2.0, a parallel data set of 50 languages paired with English, 380M sentence pairs in total. Extracted from the Internet Archive and Common Crawl hplt-project.org/datasets/v2.0
HPLT - High Performance Language Technologies
A space that combines petabytes of natural language data with large-scale model training
hplt-project.org
Reposted by Laurie Burchell
A huge thank you to @very-laurie.bsky.social for delivering a fantastic UoB Turing seminar. Her talk was entitled “Common Crawl: open web data for everybody.”

In this talk, she introduced the @commoncrawl.bsky.social and the data products they offer.
November 27, 2025 at 1:06 PM
Reposted by Laurie Burchell
The Turing Liaison Team is excited to host @very-laurie.bsky.social and Thom Vaughan to introduce the @commoncrawl.bsky.social and the data products it offers.

📆 26 November
⏰ 13:00 - 14:00
📍C44 Biomedical building, University of Bristol

Find out more: tinyurl.com/mrxp5h2n
November 21, 2025 at 11:04 AM
Reposted by Laurie Burchell
If you were able to join us, let us know about your experience: docs.google.com/forms/d/e/1F...
October 10, 2025 at 8:52 PM
Reposted by Laurie Burchell
One of the biggest obstacles to improving language technologies for low-resource languages is the lack of data. To address this, we need better language identification tools. So, we're organizing a shared task on Language Identification for Web Data! #NLP #NLProc
June 9, 2025 at 3:44 PM
I've been learning by writing the tutorials 💅

Next up: using CCF's host index commoncrawl.org/blog/introdu...
June 12, 2025 at 2:32 PM
Reposted by Laurie Burchell
This absolute icon at Canterbury Pride 🩷🤍🩵
June 7, 2025 at 12:18 PM
Reposted by Laurie Burchell
My cartoon for this week’s @newscientist.com
June 8, 2025 at 8:17 AM
✨Call for papers!✨ @commoncrawl.bsky.social and friends are organising the 1st Workshop on Multilingual Data Quality Signals, held in tandem with @colmweb.org. Submit your research on multilingual data quality!

Submission deadline is 23 June, more info: wmdqs.org
1st Workshop on Multilingual Data Quality Signals
wmdqs.org
May 28, 2025 at 8:04 AM
I'm starting as a Senior Research Engineer with the Common Crawl Foundation today! 😎
May 26, 2025 at 9:29 AM
Reposted by Laurie Burchell
There's a lot of talk about Canada Geese and whether they're good and my answer is Yes.
April 16, 2025 at 4:07 PM
I'm part of this! There's also a paper: arxiv.org/abs/2503.10267
** New parallel data set ** . We've just released HPLT v2.0, a parallel data set of 50 languages paired with English, 380M sentence pairs in total. Extracted from the Internet Archive and Common Crawl hplt-project.org/datasets/v2.0
HPLT - High Performance Language Technologies
A space that combines petabytes of natural language data with large-scale model training
hplt-project.org
March 17, 2025 at 1:27 PM
Reposted by Laurie Burchell
February 28, 2025 at 12:39 PM
I'm nerd famous
February 8, 2025 at 1:46 PM
Reposted by Laurie Burchell
Jesus. Twister in Donegal right now.
January 24, 2025 at 7:58 AM
My replacement PIR finally arrived! Hoping it doesn't short this time 🙏
January 23, 2025 at 11:26 AM
I'm pretty sure that recursion is magic.
January 18, 2025 at 1:01 PM
It's day 6 of Codemas! As befits the season, I'm coding on the kitchen table surrounded by leftovers.
December 27, 2024 at 9:42 AM
Merry Christmas! The house is quiet for a minute so I'm doing box 5 of Codemas. Maybe I'll finish the whole thing before the new year
December 25, 2024 at 11:20 AM
Uh I have something I need to tell you
December 4, 2024 at 2:34 PM
I'll return to Codemas after nearly a year, I'll bully this old Chromebook into installing an IDE and finally seeing the Pico, I'll do what I want
November 24, 2024 at 7:16 PM
Zoom zoom fully graduated Dr coming through
November 21, 2024 at 5:30 PM
Reposted by Laurie Burchell
unlike you i get my news from a reliable source
November 11, 2024 at 12:24 PM