David Smith
dasmiq.bsky.social
David Smith
@dasmiq.bsky.social
Associate professor of computer science at Northeastern University. Natural language processing, digital humanities, OCR, computational bibliography, and computational social sciences. Artificial intelligence is an archival science.
Not every protest can be across the street from a luthier.
October 18, 2025 at 7:07 PM
A magazine from 1897 gives an unusual definition of copyright: no copying without attribution. CC BY avant la lettre. #ViralTexts
June 20, 2025 at 8:53 PM
Big hairy deal
October 16, 2024 at 1:17 AM
Jaydeep Borkar starts his talk on transcribing text with lacunae.
August 31, 2024 at 7:20 AM
Jake Murel kicks off a discussion of comics translation and visual evidence.
August 30, 2024 at 1:59 PM
Gemini doesn’t think your manuscript is suitable. #DH2024
August 8, 2024 at 6:22 PM
Sentence aligned data so far:
August 8, 2024 at 6:17 PM
Anatomy of Domestic Economy^H^H^H Extravagance #dh2024
August 8, 2024 at 1:06 PM
Avery Blankenship kicks off her 9am talk on cookbooks, chemistry, and Ann Northup. #dh2024
August 8, 2024 at 1:04 PM
One doesn't have to agree with Wilhelm von Humboldt's ideal of service to Prussia, but I think it's true that if student involvement in our work didn't exist, we'd have to invent it.
June 19, 2024 at 5:28 PM
The same issue has an expanded reprint of this all-timer:

bsky.app/profile/medi...
June 18, 2024 at 3:46 PM
Each sentence here has a pretty low likelihood given the last, from Womanhood, February 1905:
June 18, 2024 at 3:43 PM
TIL that "garble" comes from sorting and grading spices. I shall now refer to all assignments as "garbleable".
May 17, 2024 at 3:27 PM
About to talk about Viral Texts in Berlin with @ryancordell.bsky.social Thanks to @dennmis.bsky.social Freie Universität and Zuse Institute!
March 12, 2024 at 2:30 PM
@muther22.bsky.social and Mathew Barber presenting our paper modeling citation and quotation as retrieval-augmented generation.
December 7, 2023 at 3:24 PM
Caroline Craig Northeastern talks about our sentence alignment work at #CHR2023.
December 6, 2023 at 2:32 PM
Full house at #CHR2023 for @wenyishang.bsky.social ‘s talk!
December 6, 2023 at 11:03 AM
BHV Marais
December 4, 2023 at 2:05 PM
@philologistgrc.bsky.social articulating an agenda for our research data at Text as Data (TADA 2023).
November 9, 2023 at 4:30 PM
Uh oh, Crisis of the Third Century around the corner
November 2, 2023 at 9:23 PM
A lot of past work on historical syntax involved treebanking text from different time periods. Instead, Liwen compares language models trained on different time periods on modern tagging and parsing tasks to detect language change.
October 24, 2023 at 1:13 PM
Good news: LaBSE and other cross-language sentence embeddings work very well for ancient Greek and Latin and these modern languages. Bad news: pruned sentence-alignment models from machine translation are really confused by paratext, footnotes, multiple translations, etc.
October 24, 2023 at 1:09 PM
This collation method can also be used to compare manuscripts with each other even when the #HTR transcription is poor.
October 24, 2023 at 1:06 PM
This works even when the print-trained layout-analysis model only identifies a few lines per page.
October 24, 2023 at 1:05 PM
Starting from a model trained only on print text, this method automatically detects passages that overlap with existing digital editions. A few rounds of bootstrapping improve accuracy by 20% and surpass manually annotated data:
October 24, 2023 at 1:04 PM