Pedro Ortiz Suarez
banner
pjox.bsky.social
Pedro Ortiz Suarez
@pjox.bsky.social
Principal Research Scientist at the Common Crawl Foundation. Weird coffee person ☕️, runner 🏃🏻‍♂️. (he/him) 🇫🇷🇪🇺🇨🇴
Reposted by Pedro Ortiz Suarez
The Common Crawl team presented a seminar at Stanford HAI entitled “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.

commoncrawl.org/blog/common-...
Common Crawl - Blog - Common Crawl Foundation at Stanford HAI
The Common Crawl team presented a seminar at Stanford HAI entitled “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.
commoncrawl.org
October 29, 2025 at 6:46 PM
Reposted by Pedro Ortiz Suarez
If you were able to join us, let us know about your experience: docs.google.com/forms/d/e/1F...
October 10, 2025 at 8:52 PM
Reposted by Pedro Ortiz Suarez
Thank you everyone for coming to WMDQS (pronounced "whim ducks")!
October 10, 2025 at 8:51 PM
Reposted by Pedro Ortiz Suarez
WMDQS is underway! Come join us in Room 520A at @colmweb.org! #COLM2025
October 10, 2025 at 4:18 PM
Reposted by Pedro Ortiz Suarez
Looking forward to tomorrow's #COLM2025 workshop on multilingual data quality! 🤩
In collaboration with @commoncrawl.bsky.social, MLCommons, and @eleutherai.bsky.social, the first edition of WMDQS at @colmweb.org starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.
October 9, 2025 at 11:16 PM
Reposted by Pedro Ortiz Suarez
In collaboration with @commoncrawl.bsky.social, MLCommons, and @eleutherai.bsky.social, the first edition of WMDQS at @colmweb.org starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.
October 9, 2025 at 8:17 PM
Reposted by Pedro Ortiz Suarez
We introduce the TableEval benchmark and investigate the effectiveness and robustness of text-based and multimodal LLMs on table understanding through a cross-domain & cross-modality evaluation.

Joint work by DFKI SLT incl. Fabio Barth, Raia Abu Ahmad, @malteos.bsky.social @pjox.bsky.social
July 26, 2025 at 9:37 AM
If you want to help us improve language and cultural coverage, and build an open source LangID system, please register to our shared task on Language Identification! 💬

Registering is easy! All the details are on the shared task webpage: wmdqs.org/shared-task/

Deadline: July 23, 2025 (AoE) ⏰
July 21, 2025 at 10:40 PM
Reposted by Pedro Ortiz Suarez
Just a few days left to contribute annotations before the first release of training data. We have over 17,000 document annotations so far!
One of the biggest obstacles to improving language technologies for low-resource languages is the lack of data. To address this, we need better language identification tools. So, we're organizing a shared task on Language Identification for Web Data! #NLP #NLProc
July 9, 2025 at 2:21 PM
Reposted by Pedro Ortiz Suarez
In June 2025 the Common Crawl Foundation, MLCommons, and EleutherAI had the pleasure of hosting a virtual hackathon in partnership with Masakhane in order to collect language identification annotations for African languages.

commoncrawl.org/blog/the-fir...
Common Crawl - Blog - The First WMDQS-Masakhane LangID Hackathon
In June 2025 the Common Crawl Foundation, MLCommons, and EleutherAI had the pleasure of hosting a virtual hackathon in partnership with Masakhane in order to collect language identification annotation...
commoncrawl.org
July 8, 2025 at 4:21 PM
Reposted by Pedro Ortiz Suarez
The Common Crawl Foundation team took part in the United Nations Open Source Week in New York City this June, meeting with global developers, researchers, and policymakers to discuss all things open source and AI.

commoncrawl.org/blog/common-...
Common Crawl - Blog - Common Crawl at the United Nations Open Source Week, June 2025
The Common Crawl Foundation team took part in the United Nations Open Source Week in New York City this June, meeting with global developers, researchers, and policymakers to discuss all things open s...
commoncrawl.org
July 1, 2025 at 12:12 AM
Reposted by Pedro Ortiz Suarez
The deadline for paper submissions has been extended!

The new deadline is July 3, 2025. AoE.

For more information, please visit: wmdqs.org
June 23, 2025 at 2:23 PM
Reposted by Pedro Ortiz Suarez
The Common Crawl Foundation, together with IBM, the AI Alliance, and BrightQuery will be hosting an "UN Conference" at IBM's new flagship NYC HQ at One Madison Avenue on Friday, June 20, from 12:30-5pm.

If you are in NYC, it would be great to see you there!

lu.ma/p0a1scde
AI Alliance @ IBM One Madison (UN Open Source Week 2025) · Luma
This year’s UN Open Source Week 2025, June 16-20) will once again bring together a global “who is who” of Open Source leaders. As part of the official…
lu.ma
June 10, 2025 at 9:54 PM
Reposted by Pedro Ortiz Suarez
Call for papers!
We are organising the 1st Workshop on Multilingual Data Quality Signals with @mlcommons.org and @eleutherai.bsky.social, held in tandem with @colmweb.org. Submit your research on multilingual data quality!

Submission deadline is 23 June, more info: wmdqs.org
1st Workshop on Multilingual Data Quality Signals
wmdqs.org
May 29, 2025 at 5:18 PM
I’ll be running the Paris Marathon this Sunday for cancer research and treatment 🏃🏻‍♂️

Please donate if you can! Every donation no matter how small, helps immensely.

marathon-paris.dossards-solidaires.org/fundraisers/...
April 11, 2025 at 10:17 PM
Reposted by Pedro Ortiz Suarez
We would like to welcome all of our attending members to Oslo, with a special welcome to two of our newest members, the Publications Office of the European Union and @commoncrawl.bsky.social!

@nettarkivet.bsky.social | #iipcGA25 | #webarchiving
April 8, 2025 at 8:58 AM
Reposted by Pedro Ortiz Suarez
Today is "I love Free Software Day".

Thank you to the @commoncrawl.bsky.social Foundation for all their hard work. Onwards! @pjox.bsky.social - So great to meet in person.
February 14, 2025 at 1:08 PM
I’ll be today at the AI Action Summit in Paris, if you’re attending and want to discuss about @commoncrawl.bsky.social or about open data, please DM me!
February 10, 2025 at 9:22 AM
We're very happy to release cc-downloader, a new CLI tool to download Common Crawl data 📚🚀🧑‍💻

‍cc-downloader is still under active development, so if you find any issues or would like to submit a feature request, please visit its GitHub repository at github.com/commoncrawl/....
January 21, 2025 at 11:57 PM
If you care about open data or anything related to crawling, The Common Crawl Foundation @commoncrawl.bsky.social is now on Bluesky 📊📈📚🥳
November 19, 2024 at 7:31 PM
Ran the Berlin marathon yesterday and while it was not my best marathon and I was recovering from injury, I had an amazing time. I really hope I can do better next year in Paris where I'll run for cancer research. If you can donate please do so: marathon-paris.dossards-solidaires.org/fundraisers/...
September 30, 2024 at 5:44 PM
I decided to run the 2025 Paris Marathon for the Gustave Roussy Institute, the Leading Cancer Centre in Europe. This is a cause close to my heart, as cancer has touched my family, my friends and colleagues:

marathon-paris.dossards-solidaires.org/fundraisers/...
Marathon for Gustave Roussy Institute
Chers amis et famille, J'ai décidé de courir le Marathon de Paris 2025, mais cette fois-ci j'ai choisi de courir pour l'Institut Gustave Roussy, premier centre de lutte contre le cancer en Europe tant...
marathon-paris.dossards-solidaires.org
August 30, 2024 at 8:20 PM
Ran the Paris Marathon yesterday. It was an amazing experience. Getting into running was probably the best decision I’ve made in recently. It has helped massively with both physical and mental health. I highly recommend any type of physical activity, especially for researchers 🏃🏻‍♂️
April 8, 2024 at 6:40 PM
I still don’t know how, but I finished my first marathon in 5:03:04 🥹
September 24, 2023 at 6:00 PM