Sewon Min
sewonm.bsky.social
Sewon Min
@sewonm.bsky.social
UC Berkeley/BAIR, AI2 || Prev: UWNLP, Meta/FAIR || sewonmin.com
Reposted by Sewon Min
🤔 Ever wondered how prevalent some type of web content is during LM pre-training?

In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐

Key takeaway: domains help us curate better pre-training data! 🧵/N
February 18, 2025 at 12:31 PM