Joseph Attieh
attiehjoseph.bsky.social
Joseph Attieh
@attiehjoseph.bsky.social
PhD student @Helsinki-NLP
Reposted by Joseph Attieh
** New parallel data set ** . We've just released HPLT v2.0, a parallel data set of 50 languages paired with English, 380M sentence pairs in total. Extracted from the Internet Archive and Common Crawl hplt-project.org/datasets/v2.0
HPLT - High Performance Language Technologies
A space that combines petabytes of natural language data with large-scale model training
hplt-project.org
February 28, 2025 at 1:34 PM