Natasha Johnson
natashamarie330.bsky.social
Natasha Johnson
@natashamarie330.bsky.social
Cultural Analytics and NLP researcher
Unsurprising: Using longer words makes female authors more “literary”

Surprising: The opposite is true for male authors

For more cool plots + findings, take a look at my #CHR2025 paper exploring the role of form vs gender in the classification of genre & literary fiction

doi.org/10.63744/Ztw...
November 18, 2025 at 11:14 PM
Even strong embedding models over-index on surface features—for every model tested, similarity scores are more reflective of author or fandom than semantic aspects like theme or characterization. This is true even if models are explicitly instructed to focus on these aspects!
November 5, 2025 at 9:59 PM
All selected fanfiction has detailed metadata and author-generated tags describing the fanfic content. Informed by fan studies and digital humanities literature, we classify these into 12 categories to construct gold labels for a fine-grained semantic similarity task.
November 5, 2025 at 9:59 PM
We introduce FicSim, a dataset of 90 recently written long-form fanfics from Archive of Our Own. We *reach out to the authors for permission* to use each work and prioritize continual, informed author consent. Fics range in length from 10K to 400K+ words.
November 5, 2025 at 9:59 PM
Digital humanities researchers often care about fine-grained similarity based on narrative elements like plot or tone, which don’t necessarily correlate with surface-level textual features.

Can embedding models capture this? We study this in the context of fanfiction!
November 5, 2025 at 9:59 PM