Natasha Johnson
natashamarie330.bsky.social
Natasha Johnson
@natashamarie330.bsky.social
Cultural Analytics and NLP researcher
Even strong embedding models over-index on surface features—for every model tested, similarity scores are more reflective of author or fandom than semantic aspects like theme or characterization. This is true even if models are explicitly instructed to focus on these aspects!
November 5, 2025 at 9:59 PM
All selected fanfiction has detailed metadata and author-generated tags describing the fanfic content. Informed by fan studies and digital humanities literature, we classify these into 12 categories to construct gold labels for a fine-grained semantic similarity task.
November 5, 2025 at 9:59 PM
We introduce FicSim, a dataset of 90 recently written long-form fanfics from Archive of Our Own. We *reach out to the authors for permission* to use each work and prioritize continual, informed author consent. Fics range in length from 10K to 400K+ words.
November 5, 2025 at 9:59 PM