David Smith
dasmiq.bsky.social
David Smith
@dasmiq.bsky.social
Associate professor of computer science at Northeastern University. Natural language processing, digital humanities, OCR, computational bibliography, and computational social sciences. Artificial intelligence is an archival science.
Reposted by David Smith
(2/2) Morphology-aware tokenization improves Latin LM performance on four downstream tasks, including gains for out-of-domain texts and rare words.

📄 arxiv.org/abs/2511.09709
Contextual morphologically-guided tokenization for Latin encoder models
Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than...
arxiv.org
November 14, 2025 at 8:02 PM
Reposted by David Smith
(1/2) 🎉 New preprint: "Contextual Morphologically-Guided Tokenization for Latin Encoder Models"
w/ @diyclassics.bsky.social @brenocon.bsky.social
November 14, 2025 at 8:02 PM
Reposted by David Smith
(2) The prediction view: the cost of processing each word in a sentence can be fully reduced to the word’s contextual predictability (i.e. surprisal). Predicting the next word is exactly what LLMs are trained to do, so they’re a great tool for evaluating this view. (3/n)
November 14, 2025 at 7:19 PM
Reposted by David Smith
We conducted a high-powered (n=368) eyetracking while reading study to test two competing views:
(1) The structural processing view: eye movements reflect the cost of mentally assembling the words of a sentence into a larger meaning. (2/n)
November 14, 2025 at 7:19 PM
Reposted by David Smith
New Preprint: osf.io/eq2ra

Reading feels effortless, but it's actually quite complex under the hood. Most words are easy to process, but some words make us reread or linger. It turns out that LLMs can tell us about why, but only in certain cases... (1/n)
November 14, 2025 at 7:19 PM
Reposted by David Smith
I curated some readings for class on "data tensions" and the list felt worth sharing. Come on a tour of datasets, books, the web, and AI with me...

We'll start with this piece on the Google Books project: the hopes, dreams, disasters, and aftermath of building a public library on the internet.

1/n
Torching the Modern-Day Library of Alexandria
“Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them.”
www.theatlantic.com
November 14, 2025 at 4:39 PM
Reposted by David Smith
This week, I taught this wonderful paper from #FAccT2025 in my CS course.

Is this a "survey paper"? Or a "position paper"? Or ... ?

Should its inclusion in arXiv face extra boundaries that randomly generated eval bench numbers go up papers don't face?

Genuine question despite my phrasing!
Liberatory Collections and Ethical AI: Reimagining AI Development from Black Community Archives and Datasets | Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency
dl.acm.org
November 13, 2025 at 2:22 AM
Reposted by David Smith
Do people remember where things are relative to their body (e.g. my left side) or relative to the environment (the North/uphill side)? The answer is both at once, according to my new paper now out in Psychological Science! 🧵 journals.sagepub.com/doi/10.1177/...
Sage Journals: Discover world-class research
Subscription and open access journals from Sage, the world's leading independent academic publisher.
journals.sagepub.com
November 12, 2025 at 1:01 PM
Reposted by David Smith
Behavioral data can be very detailed but are usually aggregated and normalized in ways that smother the dynamics. Ben wrote a continuous-time Markov model to improve on this, and also wrote simulations for exploring and validating pipelines. All the code is here: github.com/BenKawam/ASN...
New paper!

We propose a framework to empirically study animal social relationships by modelling social network (SN) data as time-series—that is, without the need to aggregate them over time.

www.biorxiv.org/content/10.1...
November 12, 2025 at 12:05 PM
Reposted by David Smith
We present alongside the paper:
1. ‘NewsWords’ - unigrams from the entire digitised collection, github.com/Living-with-...
2. Newspaper metadata, openhumanitiesdata.metajnl.com/articles/10....
3. Mitchell's Press Directories, bl.iro.bl.uk/concern/data...
3/7
GitHub - Living-with-machines/newswords: Code for the counts data derived from historical newspapers
Code for the counts data derived from historical newspapers - Living-with-machines/newswords
github.com
November 11, 2025 at 4:06 PM
Reposted by David Smith
It’s very much a prototype but I have to share a preview of the interface @djevans.bsky.social is working on for our Viral Texts data—it maps reprinting data back onto the newspaper page, allowing users to browse what reprints appeared together on each page—with links back to full reprint clusters
November 7, 2025 at 4:49 PM
Reposted by David Smith
All 4 positions are open rank & all could result in multiple hires. Folks from the DH/book history/bibliography worlds might look especially at the "Information, Culture, & Society" & open information sciences positions to see if anything resonates—happy to offer what insight I can
November 6, 2025 at 3:55 PM
Reposted by David Smith
Job alert!

@ischoolui.bsky.social is hiring in 4 areas this year

Early literacies: illinois.csod.com/ux/ats/caree...

Information behavior/HCI/UX: illinois.csod.com/ux/ats/caree...

Information, Culture & Society: illinois.csod.com/ux/ats/caree...

Open IS: illinois.csod.com/ux/ats/caree...
November 6, 2025 at 3:54 PM
Reposted by David Smith
why intern at Ai2?

🐟interns own major parts of our model development, sometimes even leading whole projects
🐡we're committed to open science & actively help our interns publish their work

reach out if u wanna build open language models together 🤝

links 👇
November 5, 2025 at 11:11 PM
Reposted by David Smith
🆕 from us at #EMNLP: Are LMs better at answering questions about Germany in German than in French? Is national knowledge linguistically contingent?

Interestingly, only for some multilingual models is this true. Aya knows China best in Chinese, but LLaMA's best in English always.
November 5, 2025 at 7:47 PM
Reposted by David Smith
While lead author @sunnytqin.bsky.social sadly couldn't go to her !!!HOMETOWN!!!! of Suzhou due to visa reentry issues, her EMNLP paper with @dmelis.bsky.social and me is still fantastically cool and I will absolutely take advantage of EMNLP week to reshare it.
Transformer LMs get pretty far by acting like ngram models, so why do they learn syntax? A new paper by sunnytqin.bsky.social, me, and @dmelis.bsky.social illuminates grammar learning in a whirlwind tour of generalization, grokking, training dynamics, memorization, and random variation. #mlsky #nlp
Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization
Language models (LMs), like other neural networks, often favor shortcut heuristics based on surface-level patterns. Although LMs behave like n-gram models early in training, they must eventually learn...
arxiv.org
November 5, 2025 at 11:03 PM
Reposted by David Smith
Digital humanities researchers often care about fine-grained similarity based on narrative elements like plot or tone, which don’t necessarily correlate with surface-level textual features.

Can embedding models capture this? We study this in the context of fanfiction!
November 5, 2025 at 9:59 PM
Reposted by David Smith
I'm recruiting multiple PhD students for Fall 2026 in Computer Science at @hopkinsengineer.bsky.social 🍂

Apply to work on AI for social sciences/human behavior, social NLP, and LLMs for real-world applied domains you're passionate about!

Learn more at kristinagligoric.com & help spread the word!
November 5, 2025 at 2:56 PM
Reposted by David Smith
Here's an interesting new study exploring whether LLMs are able to understand the narrative sequencing of comics and... even the best AI models are *terrible* at it for pretty much all tasks that were analyzed aclanthology.org/2025.finding...
Beyond Single Frames: Can LMMs Comprehend Implicit Narratives in Comic Strip?
Xiaochen Wang, Heming Xia, Jialin Song, Longyu Guan, Qingxiu Dong, Rui Li, Yixin Yang, Yifan Pu, Weiyao Luo, Yiru Wang, Xiangdi Meng, Wenjie Li, Zhifang Sui. Findings of the Association for Computatio...
aclanthology.org
November 4, 2025 at 8:05 PM
Reposted by David Smith
N-gram novelty is widely used as a measure of creativity and generalization. But if LLMs produce highly n-gram novel expressions that don’t make sense or sound awkward, should they still be called creative? In a new paper, we investigate how n-gram novelty relates to creativity.
November 4, 2025 at 3:08 PM
Reposted by David Smith
"Computational Humanities is far more than a collection of essays; it is a meticulously curated critical tool kit."

This is exactly what we were going for! dhdebates.gc.cuny.edu/projects/com...
November 3, 2025 at 5:18 PM
Reposted by David Smith
When is a language hard to model? Previous research has suggested that morphological complexity both does and does not play a role, but it does so by relating the performance of language models to corpus statistics of words or subword tokens in isolation.
November 3, 2025 at 11:53 AM
Reposted by David Smith
As DH grows, it’s increasingly important to publish conference papers, but there hasn’t been a clear venue for that.

So I’m thrilled to share this new home for DH proceedings, which will include CHR papers & more.

Thanks to @taylor-arnold.bsky.social for leading this effort!

bit.ly/ach-anthology
October 29, 2025 at 3:39 PM
Reposted by David Smith
Spent a day at this wonderful initiative by Authors Alliance and Northeastern University and supported by Mellon Foundation! Feel free to reach out to the leadership if you have ideas on how to help their mission or how they can help your research efforts.
publicinterestcorpus.org
The Public Interest Corpus
publicinterestcorpus.org
October 24, 2025 at 3:44 PM