Lightnews — Scholar-powered news

Reposted by David Smith

Marisa Hudspeth

@marisahudspeth.bsky.social

(2/2) Morphology-aware tokenization improves Latin LM performance on four downstream tasks, including gains for out-of-domain texts and rare words.

📄 arxiv.org/abs/2511.09709

Contextual morphologically-guided tokenization for Latin encoder models

Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than...

arxiv.org

November 14, 2025 at 8:02 PM

Reposted by David Smith

Marisa Hudspeth

@marisahudspeth.bsky.social

(1/2) 🎉 New preprint: "Contextual Morphologically-Guided Tokenization for Latin Encoder Models"
w/ @diyclassics.bsky.social @brenocon.bsky.social

November 14, 2025 at 8:02 PM

Reposted by David Smith

William Timkey

@wtimkey.bsky.social

(2) The prediction view: the cost of processing each word in a sentence can be fully reduced to the word’s contextual predictability (i.e. surprisal). Predicting the next word is exactly what LLMs are trained to do, so they’re a great tool for evaluating this view. (3/n)

November 14, 2025 at 7:19 PM

Reposted by David Smith

William Timkey

@wtimkey.bsky.social

We conducted a high-powered (n=368) eyetracking while reading study to test two competing views:
(1) The structural processing view: eye movements reflect the cost of mentally assembling the words of a sentence into a larger meaning. (2/n)

November 14, 2025 at 7:19 PM

Reposted by David Smith

William Timkey

@wtimkey.bsky.social

New Preprint: osf.io/eq2ra

Reading feels effortless, but it's actually quite complex under the hood. Most words are easy to process, but some words make us reread or linger. It turns out that LLMs can tell us about why, but only in certain cases... (1/n)

November 14, 2025 at 7:19 PM

Reposted by David Smith

Maria Antoniak

@mariaa.bsky.social

I curated some readings for class on "data tensions" and the list felt worth sharing. Come on a tour of datasets, books, the web, and AI with me...

We'll start with this piece on the Google Books project: the hopes, dreams, disasters, and aftermath of building a public library on the internet.

1/n

Torching the Modern-Day Library of Alexandria

“Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them.”

www.theatlantic.com

November 14, 2025 at 4:39 PM

Reposted by David Smith

Maria Antoniak

@mariaa.bsky.social

This week, I taught this wonderful paper from #FAccT2025 in my CS course.

Is this a "survey paper"? Or a "position paper"? Or ... ?

Should its inclusion in arXiv face extra boundaries that randomly generated eval bench numbers go up papers don't face?

Genuine question despite my phrasing!

Liberatory Collections and Ethical AI: Reimagining AI Development from Black Community Archives and Datasets | Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency

dl.acm.org

November 13, 2025 at 2:22 AM

Reposted by David Smith

Benjamin Pitt

@benjaminpitt.bsky.social

Do people remember where things are relative to their body (e.g. my left side) or relative to the environment (the North/uphill side)? The answer is both at once, according to my new paper now out in Psychological Science! 🧵 journals.sagepub.com/doi/10.1177/...

Sage Journals: Discover world-class research

Subscription and open access journals from Sage, the world's leading independent academic publisher.

journals.sagepub.com

November 12, 2025 at 1:01 PM

Reposted by David Smith

Richard McElreath 🐈‍⬛

@rmcelreath.bsky.social

Behavioral data can be very detailed but are usually aggregated and normalized in ways that smother the dynamics. Ben wrote a continuous-time Markov model to improve on this, and also wrote simulations for exploring and validating pipelines. All the code is here: github.com/BenKawam/ASN...

Ben Kawam @benkawam.bsky.social · 2d

New paper!

We propose a framework to empirically study animal social relationships by modelling social network (SN) data as time-series—that is, without the need to aggregate them over time.

www.biorxiv.org/content/10.1...

Each dyad (a, b) moves through four discrete states over time, represented by coloured circles. The dyad remains in a given state for a certain duration, or "holding time", before transitioning to a new state according to state-specific transition probabilities, indicated by arrows showing all possible (non-zero) transitions. Paintings by Sofia M. Pereira & Judith von Nordheim.

November 12, 2025 at 12:05 PM

Reposted by David Smith

Daniel Wilson

@danielwilson.bsky.social

We present alongside the paper:
1. ‘NewsWords’ - unigrams from the entire digitised collection, github.com/Living-with-...
2. Newspaper metadata, openhumanitiesdata.metajnl.com/articles/10....
3. Mitchell's Press Directories, bl.iro.bl.uk/concern/data...
3/7

GitHub - Living-with-machines/newswords: Code for the counts data derived from historical newspapers

Code for the counts data derived from historical newspapers - Living-with-machines/newswords

github.com

November 11, 2025 at 4:06 PM

Reposted by David Smith

Ryan Cordell

@ryancordell.org

It’s very much a prototype but I have to share a preview of the interface @djevans.bsky.social is working on for our Viral Texts data—it maps reprinting data back onto the newspaper page, allowing users to browse what reprints appeared together on each page—with links back to full reprint clusters

November 7, 2025 at 4:49 PM

Reposted by David Smith

Ryan Cordell

@ryancordell.org

All 4 positions are open rank & all could result in multiple hires. Folks from the DH/book history/bibliography worlds might look especially at the "Information, Culture, & Society" & open information sciences positions to see if anything resonates—happy to offer what insight I can

November 6, 2025 at 3:55 PM

Reposted by David Smith

Ryan Cordell

@ryancordell.org

Job alert!

@ischoolui.bsky.social is hiring in 4 areas this year

Early literacies: illinois.csod.com/ux/ats/caree...

Information behavior/HCI/UX: illinois.csod.com/ux/ats/caree...

Information, Culture & Society: illinois.csod.com/ux/ats/caree...

Open IS: illinois.csod.com/ux/ats/caree...

November 6, 2025 at 3:54 PM

Reposted by David Smith

Kyle Lo

@kylelo.bsky.social

why intern at Ai2?

🐟interns own major parts of our model development, sometimes even leading whole projects
🐡we're committed to open science & actively help our interns publish their work

reach out if u wanna build open language models together 🤝

links 👇

November 5, 2025 at 11:11 PM

Reposted by David Smith

Michael Saxon

@saxon.me

🆕 from us at #EMNLP: Are LMs better at answering questions about Germany in German than in French? Is national knowledge linguistically contingent?

Interestingly, only for some multilingual models is this true. Aya knows China best in Chinese, but LLaMA's best in English always.

November 5, 2025 at 7:47 PM

Reposted by David Smith

Naomi Saphra

@nsaphra.bsky.social

While lead author @sunnytqin.bsky.social sadly couldn't go to her !!!HOMETOWN!!!! of Suzhou due to visa reentry issues, her EMNLP paper with @dmelis.bsky.social and me is still fantastically cool and I will absolutely take advantage of EMNLP week to reshare it.

Naomi Saphra @nsaphra.bsky.social · Dec 20

Transformer LMs get pretty far by acting like ngram models, so why do they learn syntax? A new paper by sunnytqin.bsky.social, me, and @dmelis.bsky.social illuminates grammar learning in a whirlwind tour of generalization, grokking, training dynamics, memorization, and random variation. #mlsky #nlp

Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization

Language models (LMs), like other neural networks, often favor shortcut heuristics based on surface-level patterns. Although LMs behave like n-gram models early in training, they must eventually learn...

arxiv.org

November 5, 2025 at 11:03 PM

Reposted by David Smith

Natasha Johnson

@natashamarie330.bsky.social

This was joint work with @abertsch.bsky.social, Maria-Emil Deal, and @strubell.bsky.social
Paper: arxiv.org/abs/2510.20926
Dataset: huggingface.co/datasets/fic...

FicSim: A Dataset for Multi-Faceted Semantic Similarity in Long-Form Fiction

As language models become capable of processing increasingly long and complex texts, there has been growing interest in their application within computational literary studies. However, evaluating the...

arxiv.org

November 5, 2025 at 9:59 PM

Reposted by David Smith

Natasha Johnson

@natashamarie330.bsky.social

Digital humanities researchers often care about fine-grained similarity based on narrative elements like plot or tone, which don’t necessarily correlate with surface-level textual features.

Can embedding models capture this? We study this in the context of fanfiction!

Figure showing a similarity comparison between three stories. Story A and story B have the same author, and story A and story C have the same tone. A human might care about which stories are tonally the most similar, but a language model's notion of similarity is strongly informed by surface-level features like small differences in writing style across authors.

November 5, 2025 at 9:59 PM

Reposted by David Smith

Kristina Gligoric

@gligoric.bsky.social

I'm recruiting multiple PhD students for Fall 2026 in Computer Science at @hopkinsengineer.bsky.social 🍂

Apply to work on AI for social sciences/human behavior, social NLP, and LLMs for real-world applied domains you're passionate about!

Learn more at kristinagligoric.com & help spread the word!

November 5, 2025 at 2:56 PM

Reposted by David Smith

Neil Cohn

@neilcohn.bsky.social

Here's an interesting new study exploring whether LLMs are able to understand the narrative sequencing of comics and... even the best AI models are *terrible* at it for pretty much all tasks that were analyzed aclanthology.org/2025.finding...

Beyond Single Frames: Can LMMs Comprehend Implicit Narratives in Comic Strip?

Xiaochen Wang, Heming Xia, Jialin Song, Longyu Guan, Qingxiu Dong, Rui Li, Yixin Yang, Yifan Pu, Weiyao Luo, Yiru Wang, Xiangdi Meng, Wenjie Li, Zhifang Sui. Findings of the Association for Computatio...

aclanthology.org

November 4, 2025 at 8:05 PM

Reposted by David Smith

Arkadiy Saakyan

@asaakyan.bsky.social

N-gram novelty is widely used as a measure of creativity and generalization. But if LLMs produce highly n-gram novel expressions that don’t make sense or sound awkward, should they still be called creative? In a new paper, we investigate how n-gram novelty relates to creativity.

N-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity's dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and n-gram novelty through 7542 expert writer annotations (n=26) of novelty, pragmaticality, and sensicality via close reading of human and AI-generated text. We find that while n-gram novelty is positively associated with expert writer-judged creativity, ~91% of top-quartile expressions by n-gram novelty are not judged as creative, cautioning against relying on n-gram novelty alone. Furthermore, unlike human-written text, higher n-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.

November 4, 2025 at 3:08 PM

Reposted by David Smith

David Mimno

@dmimno.bsky.social

"Computational Humanities is far more than a collection of essays; it is a meticulously curated critical tool kit."

This is exactly what we were going for! dhdebates.gc.cuny.edu/projects/com...

November 3, 2025 at 5:18 PM

Reposted by David Smith

LAGoM NLP

@lagom-nlp.bsky.social

When is a language hard to model? Previous research has suggested that morphological complexity both does and does not play a role, but it does so by relating the performance of language models to corpus statistics of words or subword tokens in isolation.

November 3, 2025 at 11:53 AM

Reposted by David Smith

Melanie Walsh

@mellymeldubs.bsky.social

As DH grows, it’s increasingly important to publish conference papers, but there hasn’t been a clear venue for that.

So I’m thrilled to share this new home for DH proceedings, which will include CHR papers & more.

Thanks to @taylor-arnold.bsky.social for leading this effort!

bit.ly/ach-anthology

Screenshot that reads:

Introducing the Anthology for Computers and the Humanities

Taylor Arnold, Maria Antoniak, Miguel Escobar Varela, Marie Puren, Mila Oiva , Amanda Regan, Lauren Tilton, and Melanie Walsh

1 Data Science and Statistics, University of Richmond, U.S.A.
2 Computer Science, University of Colorado Boulder, U.S.A.
3 Faculty of Arts and Social Sciences, National University of Singapore
4 Laboratoire de Recherche de l'EPITA, Paris, France
5 History and Archaeology, University of Turku, Finland
6 History and Geography, Clemson University, U.S.A.
7 Rhetoric and Communication Studies, University of Richmond, U.S.A.
8 Information School, University of Washington, U.S.A.

Permanent Link: https://doi.org/10.63744/HHsQG7hNWyxG

Published: 25 September 2025

October 29, 2025 at 3:39 PM

Reposted by David Smith

Nina Beguš

@ninabegus.bsky.social

Spent a day at this wonderful initiative by Authors Alliance and Northeastern University and supported by Mellon Foundation! Feel free to reach out to the leadership if you have ideas on how to help their mission or how they can help your research efforts.
publicinterestcorpus.org

The Public Interest Corpus

publicinterestcorpus.org

October 24, 2025 at 3:44 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news