Lightnews — Scholar-powered news

David Smith

@dasmiq.bsky.social

5.2K followers 300 following 350 posts

Associate professor of computer science at Northeastern University. Natural language processing, digital humanities, OCR, computational bibliography, and computational social sciences. Artificial intelligence is an archival science.

Posts Replies Media Videos

David Smith

@dasmiq.bsky.social

Not every protest can be across the street from a luthier.

October 18, 2025 at 7:07 PM

David Smith

@dasmiq.bsky.social

A magazine from 1897 gives an unusual definition of copyright: no copying without attribution. CC BY avant la lettre. #ViralTexts

It is queer how many publishers do not know the meaning of the word "copyright." The Club Woman is copyrighted, which means that no other publisher has a right to reprint articles from this magazine without giving full credit to THE CLUB WOMAN. But we are continually coming across articles and paragraphs taken bodily from our columns and reprinted as original in some other periodical. In one case a Boston paper did this and the articles which was "lifted" (to put it politely) has been copied far and wide, with credit to the "lifting" publication—one which, by the way, stands for the best and highest advancement of woman!

June 20, 2025 at 8:53 PM

David Smith

@dasmiq.bsky.social

Big hairy deal

October 16, 2024 at 1:17 AM

David Smith

@dasmiq.bsky.social

Jaydeep Borkar starts his talk on transcribing text with lacunae.

August 31, 2024 at 7:20 AM

David Smith

@dasmiq.bsky.social

Jake Murel kicks off a discussion of comics translation and visual evidence.

August 30, 2024 at 1:59 PM

David Smith

@dasmiq.bsky.social

Gemini doesn’t think your manuscript is suitable. #DH2024

August 8, 2024 at 6:22 PM

David Smith

@dasmiq.bsky.social

Sentence aligned data so far:

August 8, 2024 at 6:17 PM

David Smith

@dasmiq.bsky.social

Anatomy of Domestic Economy^H^H^H Extravagance #dh2024

August 8, 2024 at 1:06 PM

David Smith

@dasmiq.bsky.social

Avery Blankenship kicks off her 9am talk on cookbooks, chemistry, and Ann Northup. #dh2024

August 8, 2024 at 1:04 PM

David Smith

@dasmiq.bsky.social

One doesn't have to agree with Wilhelm von Humboldt's ideal of service to Prussia, but I think it's true that if student involvement in our work didn't exist, we'd have to invent it.

It is a characteristic of institutions of higher learning that they treat knowledge (Wissenschaft) always as a problem that is not yet completely solved and that they thus always remain engaged in research, while schools are engaged with, and teach, information as ready and completed to be mastered (Kentnissen). The relationship between teacher and student becomes thus something completely different than before. The teacher is not there for the student. Both are there for knowledge. His [the teacher’s] business depends upon their [the students’] presence and would, without them, not proceed nearly so well. He would, if they did not assemble themselves around him, seek them out, so that he could more nearly approach his goal through linking the well-practiced, but also therefore more one-sided and indeed less dynamic power with less-developed and still unaligned, power that strives mightily in every direction.

June 19, 2024 at 5:28 PM

David Smith

@dasmiq.bsky.social

The same issue has an expanded reprint of this all-timer:

bsky.app/profile/medi...

Rosamond Dixey of Boston conveys her pet pig in a motor-car, and a lady on Fifth Avenue goes shopping with her hand-bag carried by her penguin.

June 18, 2024 at 3:46 PM

David Smith

@dasmiq.bsky.social

Each sentence here has a pretty low likelihood given the last, from Womanhood, February 1905:

A magazine clipping reading, "A woman has the distinction of submitting to the Boiler Department of Chicago the best plans for the removal of boilers from underneath the city pavements. Her name is Mrs. Annie Hall, who for twenty-five years has been a manufacturer of poker chips. She is a Hollander, cam to America in 1871, and since then has accumulated a fortune. She has besides mastered most of the arts and sciences, being also a proficient in languages."

June 18, 2024 at 3:43 PM

David Smith

@dasmiq.bsky.social

TIL that "garble" comes from sorting and grading spices. I shall now refer to all assignments as "garbleable".

The beginning of a proclamation from 1708, reading: "To the end that all Persons Owners of any Spices, Drugs, or other Wares or Merchandises, Garbleable within this City and Liberties thereof, and desirous to have the same Garbled, may know who is the proper Officer to perform this Duty, and..."

May 17, 2024 at 3:27 PM

David Smith

@dasmiq.bsky.social

About to talk about Viral Texts in Berlin with @ryancordell.bsky.social Thanks to @dennmis.bsky.social Freie Universität and Zuse Institute!

March 12, 2024 at 2:30 PM

David Smith

@dasmiq.bsky.social

@muther22.bsky.social and Mathew Barber presenting our paper modeling citation and quotation as retrieval-augmented generation.

December 7, 2023 at 3:24 PM

David Smith

@dasmiq.bsky.social

Caroline Craig Northeastern talks about our sentence alignment work at #CHR2023.

December 6, 2023 at 2:32 PM

David Smith

@dasmiq.bsky.social

Full house at #CHR2023 for @wenyishang.bsky.social ‘s talk!

December 6, 2023 at 11:03 AM

David Smith

@dasmiq.bsky.social

BHV Marais

December 4, 2023 at 2:05 PM

David Smith

@dasmiq.bsky.social

@philologistgrc.bsky.social articulating an agenda for our research data at Text as Data (TADA 2023).

November 9, 2023 at 4:30 PM

David Smith

@dasmiq.bsky.social

Uh oh, Crisis of the Third Century around the corner

November 2, 2023 at 9:23 PM

David Smith

@dasmiq.bsky.social

A lot of past work on historical syntax involved treebanking text from different time periods. Instead, Liwen compares language models trained on different time periods on modern tagging and parsing tasks to detect language change.

Graph comparing lexical confusion between historical and modern language models.

October 24, 2023 at 1:13 PM

David Smith

@dasmiq.bsky.social

Good news: LaBSE and other cross-language sentence embeddings work very well for ancient Greek and Latin and these modern languages. Bad news: pruned sentence-alignment models from machine translation are really confused by paratext, footnotes, multiple translations, etc.

Image of alignment of Thucydides in Greek and French.

October 24, 2023 at 1:09 PM

David Smith

@dasmiq.bsky.social

This collation method can also be used to compare manuscripts with each other even when the #HTR transcription is poor.

Image of a line in two manuscripts, where the first has an extra word.

October 24, 2023 at 1:06 PM

David Smith

@dasmiq.bsky.social

This works even when the print-trained layout-analysis model only identifies a few lines per page.

A picture of a Persian manuscript page showing only a few lines correctly identified by the layout model.

October 24, 2023 at 1:05 PM

David Smith

@dasmiq.bsky.social

Starting from a model trained only on print text, this method automatically detects passages that overlap with existing digital editions. A few rounds of bootstrapping improve accuracy by 20% and surpass manually annotated data:

A graph showing retrained models achieving accuracy 20% on average above the baseline.

October 24, 2023 at 1:04 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news