Lightnews — Scholar-powered news

Kabir Ahuja

@kabirahuja2431.bsky.social

I had always wanted to work on something that can combine my love for fiction and NLP research, making this project a lot of fun. Huge thanks to the wonderful @melaniesclar.bsky.social and @tsvetshop.bsky.social!

We welcome any feedback and questions -- don't hesitate to reach out!

16/16

April 22, 2025 at 6:50 PM

Kabir Ahuja

@kabirahuja2431.bsky.social

FlawedFictions is now available on 🤗: huggingface.co/datasets/ka...

Code: github.com/kabirahuja2...

15/n

GitHub - kabirahuja2431/FlawedFictions

Contribute to kabirahuja2431/FlawedFictions development by creating an account on GitHub.

github.com

April 22, 2025 at 6:50 PM

Kabir Ahuja

@kabirahuja2431.bsky.social

Overall, our work shows that deep narrative understanding/reasoning and generating logically consistent stories remains challenging even for frontier models. Read the full paper for more details: arxiv.org/abs/2504.11900

14/n

Finding Flawed Fictions: Evaluating Complex Reasoning in Language...

Stories are a fundamental aspect of human experience. Engaging deeply with stories and spotting plot holes -- inconsistencies in a storyline that break the internal logic or rules of a story's...

arxiv.org

April 22, 2025 at 6:50 PM

Kabir Ahuja

@kabirahuja2431.bsky.social

But how can story summaries have plot holes? Upon close inspection we find LLMs often omit crucial details in the summary that make subsequent events illogical or inconsistent. This highlights weaknesses in summarization—a task many consider "solved" with current LLMs.

13/n

April 22, 2025 at 6:50 PM

Kabir Ahuja

@kabirahuja2431.bsky.social

Our results show LLM-generated content contains significantly more plot holes than human-authored stories: 50%+ higher detection rates for summaries and 100%+ increase for contemporary adaptations of classics.

12/n

April 22, 2025 at 6:50 PM

Kabir Ahuja

@kabirahuja2431.bsky.social

We then assess plot holes in LLM generated text, focusing on tasks of story summarization and contemporary adaptation of classical stories. We use our best model on FlawedFictions to automatically detect the presence of plot holes in LLM generated stories.

11/n

April 22, 2025 at 6:50 PM

Kabir Ahuja

@kabirahuja2431.bsky.social

What mistakes do models make while assessing plot holes? Our analysis shows they:
- Misinterpret character motivations
- Incorrectly track entity states
- Miss genre conventions (especially in fantasy)
- Misinterpret story rules Examples 👇🏻

10/n

April 22, 2025 at 6:50 PM

Kabir Ahuja

@kabirahuja2431.bsky.social

Does extra test time compute help? Mostly no. Increasing reasoning effort for o1 and o3-mini shows no improvements. Claude-3.7-Sonnet's extended thinking helps, but still underperforms models using <50% of the test time compute.

9/n

April 22, 2025 at 6:50 PM

Kabir Ahuja

@kabirahuja2431.bsky.social

Yet on FlawedFictionsLong (our benchmark with longer stories), even the best models barely outperform trivial baselines. And these stories are still under 4000 words—far shorter than novels or screenplays where plot holes typically occur.

8/n

April 22, 2025 at 6:50 PM

Kabir Ahuja

@kabirahuja2431.bsky.social

We find that most open-weight models and proprietary LLMs like GPT-4o-mini, GPT-4o, and Claude-Haiku struggle on the task, often only slightly improving over trivial baselines. Advanced models like Claude-3.5-Sonnet and o1 fare better, approaching human performance.

7/n

April 22, 2025 at 6:50 PM

Kabir Ahuja

@kabirahuja2431.bsky.social

We tested various LLMs on FlawedFictions. For classification task we report accuracy and for localization task we define CEEval-Full (0-1) that measures if the models correctly localize the sentences with error and the sentences contradicted by the error.

6/n

April 22, 2025 at 6:50 PM

Kabir Ahuja

@kabirahuja2431.bsky.social

Using FlawedFictionsMaker + human verification, we created FlawedFictions - a benchmark for plot hole detection that tests: a) identifying if a story contains a plot hole, and b) localizing both the error and the contradicted fact in the text

5/n

April 22, 2025 at 6:50 PM

Kabir Ahuja

@kabirahuja2431.bsky.social

We introduce FlawedFictionsMaker an algorithm to controllably generate plot holes in stories by extracting facts from a story's first act and contradicting them later in the story.

E.g. If Watson has a left arm injury, we edit it to become a knee injury in later mentions.

4/n

Diagram showing the"FlawedFictionsMaker" algorithm that introduces plot holes into stories. It has 5 steps labeled A through E: A: "Partition Original Story in Three Acts" - Shows three story snippets about Watson's injured left arm. B: "Extract Story Facts" - Lists facts including "Sherlock lives in Baker Street" and "Watson has a war wound on his left arm." C: "Select and Build Contradicting Fact" - Shows "What if Watson had a war wound on his left knee instead?" D: "Generate Counterfactual Story" - Shows the same three story snippets but with "knee" replacing "arm" in red text. E: "Rebuild Story, Creating a Plot Hole" - Shows the altered story with inconsistent mentions of both arm and knee injuries.

April 22, 2025 at 6:50 PM

Kabir Ahuja

@kabirahuja2431.bsky.social

It can also be interpreted as inference time world-modeling - inferring the rules of a story's world at test time and assessing if they're consistently followed throughout the narrative.

3/n

April 22, 2025 at 6:50 PM

Kabir Ahuja

@kabirahuja2431.bsky.social

Why study plot hole detection? It's a sophisticated reasoning problem requiring:
- Tracking states across long contexts
- Common sense & pragmatics for implicit details
- Theory of mind for character motivations/beliefs

2/n

April 22, 2025 at 6:50 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news