David Smith
dasmiq.bsky.social
David Smith
@dasmiq.bsky.social
Associate professor of computer science at Northeastern University. Natural language processing, digital humanities, OCR, computational bibliography, and computational social sciences. Artificial intelligence is an archival science.
Reposted by David Smith
We present alongside the paper:
1. ‘NewsWords’ - unigrams from the entire digitised collection, github.com/Living-with-...
2. Newspaper metadata, openhumanitiesdata.metajnl.com/articles/10....
3. Mitchell's Press Directories, bl.iro.bl.uk/concern/data...
3/7
GitHub - Living-with-machines/newswords: Code for the counts data derived from historical newspapers
Code for the counts data derived from historical newspapers - Living-with-machines/newswords
github.com
November 11, 2025 at 4:06 PM
Reposted by David Smith
It’s very much a prototype but I have to share a preview of the interface @djevans.bsky.social is working on for our Viral Texts data—it maps reprinting data back onto the newspaper page, allowing users to browse what reprints appeared together on each page—with links back to full reprint clusters
November 7, 2025 at 4:49 PM
Reposted by David Smith
All 4 positions are open rank & all could result in multiple hires. Folks from the DH/book history/bibliography worlds might look especially at the "Information, Culture, & Society" & open information sciences positions to see if anything resonates—happy to offer what insight I can
November 6, 2025 at 3:55 PM
Reposted by David Smith
Job alert!

@ischoolui.bsky.social is hiring in 4 areas this year

Early literacies: illinois.csod.com/ux/ats/caree...

Information behavior/HCI/UX: illinois.csod.com/ux/ats/caree...

Information, Culture & Society: illinois.csod.com/ux/ats/caree...

Open IS: illinois.csod.com/ux/ats/caree...
November 6, 2025 at 3:54 PM
Reposted by David Smith
why intern at Ai2?

🐟interns own major parts of our model development, sometimes even leading whole projects
🐡we're committed to open science & actively help our interns publish their work

reach out if u wanna build open language models together 🤝

links 👇
November 5, 2025 at 11:11 PM
Reposted by David Smith
🆕 from us at #EMNLP: Are LMs better at answering questions about Germany in German than in French? Is national knowledge linguistically contingent?

Interestingly, only for some multilingual models is this true. Aya knows China best in Chinese, but LLaMA's best in English always.
November 5, 2025 at 7:47 PM
Reposted by David Smith
While lead author @sunnytqin.bsky.social sadly couldn't go to her !!!HOMETOWN!!!! of Suzhou due to visa reentry issues, her EMNLP paper with @dmelis.bsky.social and me is still fantastically cool and I will absolutely take advantage of EMNLP week to reshare it.
Transformer LMs get pretty far by acting like ngram models, so why do they learn syntax? A new paper by sunnytqin.bsky.social, me, and @dmelis.bsky.social illuminates grammar learning in a whirlwind tour of generalization, grokking, training dynamics, memorization, and random variation. #mlsky #nlp
Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization
Language models (LMs), like other neural networks, often favor shortcut heuristics based on surface-level patterns. Although LMs behave like n-gram models early in training, they must eventually learn...
arxiv.org
November 5, 2025 at 11:03 PM
Reposted by David Smith
Digital humanities researchers often care about fine-grained similarity based on narrative elements like plot or tone, which don’t necessarily correlate with surface-level textual features.

Can embedding models capture this? We study this in the context of fanfiction!
November 5, 2025 at 9:59 PM
Reposted by David Smith
I'm recruiting multiple PhD students for Fall 2026 in Computer Science at @hopkinsengineer.bsky.social 🍂

Apply to work on AI for social sciences/human behavior, social NLP, and LLMs for real-world applied domains you're passionate about!

Learn more at kristinagligoric.com & help spread the word!
November 5, 2025 at 2:56 PM
Reposted by David Smith
Here's an interesting new study exploring whether LLMs are able to understand the narrative sequencing of comics and... even the best AI models are *terrible* at it for pretty much all tasks that were analyzed aclanthology.org/2025.finding...
Beyond Single Frames: Can LMMs Comprehend Implicit Narratives in Comic Strip?
Xiaochen Wang, Heming Xia, Jialin Song, Longyu Guan, Qingxiu Dong, Rui Li, Yixin Yang, Yifan Pu, Weiyao Luo, Yiru Wang, Xiangdi Meng, Wenjie Li, Zhifang Sui. Findings of the Association for Computatio...
aclanthology.org
November 4, 2025 at 8:05 PM
Reposted by David Smith
N-gram novelty is widely used as a measure of creativity and generalization. But if LLMs produce highly n-gram novel expressions that don’t make sense or sound awkward, should they still be called creative? In a new paper, we investigate how n-gram novelty relates to creativity.
November 4, 2025 at 3:08 PM
Reposted by David Smith
"Computational Humanities is far more than a collection of essays; it is a meticulously curated critical tool kit."

This is exactly what we were going for! dhdebates.gc.cuny.edu/projects/com...
November 3, 2025 at 5:18 PM
Reposted by David Smith
When is a language hard to model? Previous research has suggested that morphological complexity both does and does not play a role, but it does so by relating the performance of language models to corpus statistics of words or subword tokens in isolation.
November 3, 2025 at 11:53 AM
Reposted by David Smith
As DH grows, it’s increasingly important to publish conference papers, but there hasn’t been a clear venue for that.

So I’m thrilled to share this new home for DH proceedings, which will include CHR papers & more.

Thanks to @taylor-arnold.bsky.social for leading this effort!

bit.ly/ach-anthology
October 29, 2025 at 3:39 PM
Reposted by David Smith
Spent a day at this wonderful initiative by Authors Alliance and Northeastern University and supported by Mellon Foundation! Feel free to reach out to the leadership if you have ideas on how to help their mission or how they can help your research efforts.
publicinterestcorpus.org
The Public Interest Corpus
publicinterestcorpus.org
October 24, 2025 at 3:44 PM
Not every protest can be across the street from a luthier.
October 18, 2025 at 7:07 PM
Reposted by David Smith
This is happening a little, but only a little. Higher ed is a napping giant, of sorts, relatively trusted, with high approval, with very well cultivated networks with latent possibilities for mobilization.
October 18, 2025 at 12:12 PM
Reposted by David Smith
The possibility is for universities to reach out to explain how the current moment is undermining their ability to achieve these essential values, to mobilize people's outreach to elected officials, & to do this in a way that is coordinated across institutions.
October 18, 2025 at 12:12 PM
Reposted by David Smith
Large majorities see universities/higher ed as important to things that they value (the economy, science, health), & substantial numbers indicate a willingness to be mobilized to support universities that they have a connection to.
October 18, 2025 at 12:12 PM
Reposted by David Smith
Those of you at universities might share our recent report on public opinion with their leadership.

There are takeaways that are highly relevant to their choices in the current moment. In particular, there is broad, bipartisan, but latent, support for higher education.

edbarometer.org
American Higher Education Barometer
Measuring American Attitudes Towards Higher Education
edbarometer.org
October 18, 2025 at 12:12 PM
Reposted by David Smith
Small models work great for GLAM but there aren't enough examples!

With @wjbmattingly.bsky.social I'm launching small-models-for-glam on @hf.co to create/curate models that run on modest hardware and address GLAM use cases.

Follow the org to keep up-to-date!
huggingface.co/small-models...
October 16, 2025 at 1:22 PM
Reposted by David Smith
Humans created AI, and we can still shape its future. That is why @macfound joined a broad coalition of our peers to launch Humanity AI, a new initiative to keep people at the center of our AI future.

Learn more: humanityai.ai
Home - Humanity AI
Our future with AI can and will be what we make it. Humanity AI is uniting philanthropy in a broad […]
humanityai.ai
October 14, 2025 at 5:27 PM
Reposted by David Smith
New issue of my newsletter: “The Library’s New Entryway” — An interface that combines the advantages of the traditional index with the power of LLMs is the path forward newsletter.dancohen.org/archive/the-...
The Library’s New Entryway
An interface that combines the advantages of the traditional index with the power of LLMs is the path forward
newsletter.dancohen.org
October 10, 2025 at 7:32 PM
Reposted by David Smith
It's been brewing for months: @inriaparisnlp.bsky.social releases CoMMA (Corpus of Multilingual Medieval Archives) !

📚 2.5bn tokens of mostly Latin and French texts
🕰️ 800→1600 CE
📜 23k manuscripts
🖥️ 18k on the reading interface: comma.inria.fr
🔍 Paper: inria.hal.science/hal-05299220v1

(1/🧵)
CoMMA
comma.inria.fr
October 15, 2025 at 2:51 PM