Jeff Tharsen 康森傑
banner
tharsen.bsky.social
Jeff Tharsen 康森傑
@tharsen.bsky.social
doting father, friend & ally; hyperpolyglot computational philologist & sinologist;
currently teaching AI, deep learning + multilingual NLP/NLU + HPC + humanities data science @UChicago, creating new methods for multilingual intertextuality、古聲韻學、文字學等等
Reposted by Jeff Tharsen 康森傑
If you're in Chicago tomorrow, go check out "321 Plays for Trans Futures". There'll be a video recording at some point, too!
321 Plays for Trans Futures | Notion
You can read the Script Lead note here:
morrism.notion.site
November 19, 2025 at 4:54 PM
This is the way.
November 18, 2025 at 1:07 AM
Reposted by Jeff Tharsen 康森傑
And we have similar problems for pretraining datasets in general. Whatever issues we might have with that data, they're very hard to study unless the data is open. Companies aren't going to do this, but non-profits and academic groups have started constructing open datasets for research.

5/n
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Jha, Sachin Kumar, Li Lu...
aclanthology.org
November 14, 2025 at 4:54 PM
Reposted by Jeff Tharsen 康森傑
Here's how it works:
→ Submit your research idea or upvote existing ones (tag: "Weekly Competition")
→ Each Monday we select top 3 from previous week
→ We run experiments using research agents
→ Share repos + findings back on IdeaHub

Vote here: hypogenic.ai/ideahub
Hypogenic AI - Shaping the Future of Science
Reimagining science by augmenting scientist-AI collaboration.
hypogenic.ai
November 10, 2025 at 9:33 PM