Stella Li
stellali.bsky.social
Stella Li
@stellali.bsky.social
PhD student @uwnlp.bsky.social @uwcse.bsky.social | visiting researcher @MetaAI | previously @jhuclsp.bsky.social

https://stellalisy.com
Pinned
🤔💭What even is reasoning? It's time to answer the hard questions!

We built the first unified taxonomy of 28 cognitive elements underlying reasoning

Spoiler—LLMs commonly employ sequential reasoning, rarely self-awareness, and often fail to use correct reasoning structures🧠
Co-led with @pkargupta.bsky.social ✨ We learned so much and couldn't have done it w/o our amazing collaborators and mentors: Ken Wang, Jinu Lee, @shan23chen.bsky.social, @orevaahia.bsky.social, Dean Light, Tom Griffiths, @maxkw.bsky.social, Jiawei Han, @asli-celikyilmaz.bsky.social, Yulia Tsvetkov🩵
November 25, 2025 at 6:26 PM
More fun details, especially extension cognitive science background💫, in our 24-page paper!

📄Paper: arxiv.org/abs/2511.16660
💻Code: github.com/pkargupta/co...
🤗Data: huggingface.co/collections/...
🌐 Blogpost: tinyurl.com/cognitive-fo...
Cognitive Foundations for Reasoning and Their Manifestation in LLMs
Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. To understand...
arxiv.org
November 25, 2025 at 6:26 PM
What our Cognitive Foundations framework enables:

🔍 Systematic diagnosis of reasoning failures
🎯 Predicting which training→which capabilities
🧪 Testing cognitive theories at scale
🌉 Shared vocabulary bridging cognition & AI research

More on opportunities & challenges in📄
November 25, 2025 at 6:26 PM
Test-time reasoning guidance: up to 66.7% improvement 💡

We scaffold cognitive structures from successful traces to guide reasoning.

Major gains on ill-structured problems🌟

Models possess latent capabilities—they just don't deploy them adaptively without explicit guidance.
November 25, 2025 at 6:26 PM
🧑🏻Humans reason differently‼️ More abstraction (54% vs 36%), self-awareness (49% vs 19%), conceptual processing. Less surface enumeration and rigid sequential chaining.

Even with correct answers—underlying mechanisms diverge fundamentally.
November 25, 2025 at 6:26 PM
We analyzed 1,598 LLM reasoning papers:

Research concentrates on easily quantifiable behaviors—sequential organization (55%), decomposition (60%)

Neglects meta-cognitive controls (8-16%) and alternative representations (10-27%) that correlate with success⚠️
November 25, 2025 at 6:26 PM
Structure matters as much as presence📐

We introduce method to extract reasoning structure from traces

Successful: selective attention → knowledge alignment → forward chaining
Common: skip to forward chaining

LLMs prematurely seek solution before understanding constraints‼️
November 25, 2025 at 6:26 PM
Model-specific patterns reveal training impact:

Olmo3 exhibits more diverse cognitive elements (49%)—they explicitly included meta-reasoning data during midtraining.
DeepHermes-3: only 12% avg presence.

Training methodology shapes cognitive profiles dramatically.
November 25, 2025 at 6:26 PM
Meta-cognitive deficit is severe:

🤔Self-awareness: 16% in research design, 19% in LLM traces vs 49% in humans
🧐Self-evaluation on non-verifiable problems collapses (53.5% presence, 0.031 correlation)

Models can't self-assess without ground truth.
November 25, 2025 at 6:26 PM
The presence-effectiveness paradox:

Logical coherence: 91% of traces, 0.091 corr. w/ success Knowledge alignment: 20% of traces, 0.234 correlation (high)

Models frequently attempt core elements but fail to execute. Having the capability ≠ deploying it successfully😬
November 25, 2025 at 6:26 PM
Models deploy strategies inversely to what works 🚨

As problems become ill-structured, models narrow their repertoire—but successful traces show need for greater diversity (successful = high ppmi in fig).

Sequential organization dominates. Meta-cognition disappears in LLMs.
November 25, 2025 at 6:26 PM
We analyze 192K LLM reasoning traces from 18 models (text,image,video)+LLM 54 humans think-aloud traces

We introduce a framework for fine-grained span-level cognitive evaluation: WHICH elements appear, WHERE, and HOW they're sequenced.

First analysis of its kind at this scale📊
November 25, 2025 at 6:26 PM
Our taxonomy bridges cognitive science → LLM eval:

28 elements across 4 dimensions—reasoning invariants (compositionality, logical coherence), meta-cognitive controls (self-awareness), representations (hierarchical, causal), and operations (backtracking, verification)
November 25, 2025 at 6:26 PM
LLMs solve hard problems but fail on easy variants, exhibit patterns different from humans.

The issue: reasoning evaluations is by outcomes w/o understanding the cognitive processes that produce them. We can't diagnose failures or predict how training produces capabilities🚨
November 25, 2025 at 6:26 PM
🤔💭What even is reasoning? It's time to answer the hard questions!

We built the first unified taxonomy of 28 cognitive elements underlying reasoning

Spoiler—LLMs commonly employ sequential reasoning, rarely self-awareness, and often fail to use correct reasoning structures🧠
November 25, 2025 at 6:26 PM
Reposted by Stella Li
Because Olmo 3 is fully open, we decontaminate our evals from our pretraining and midtraining data. @stellali.bsky.social proves this with spurious rewards: RL trained on a random reward signal can't improve on the evals, unlike some previous setups
November 20, 2025 at 8:38 PM
Reposted by Stella Li
Day 1 (Tue Oct 7) 4:30-6:30pm, Poster Session 2

Poster #77: ALFA: Aligning LLMs to Ask Good Questions: A Case Study in Clinical Reasoning; led by
@stellali.bsky.social & @jiminmun.bsky.social
October 6, 2025 at 2:51 PM
This project was done as part of the Meta FAIR AIM mentorship program. Special thanks to my amazing collaborators and awesome mentors @melaniesclar.bsky.social @jcqln_h @hunterjlang @AnsongNi @andrew_e_cohen @jacoby_xu @chan_young_park @tsvetshop.bsky.social ‪@asli-celikyilmaz.bsky.social‬ 🫶🏻💙
July 22, 2025 at 2:59 PM
✨PrefPalette🎨 bridges cognitive science, social psychology, and AI for explainable preference modeling✨

📖Paper: arxiv.org/abs/2507.13541
💻Code: github.com/stellalisy/P...

Join us in shaping interpretable AI that you can trust and control🚀Feedback welcome!
#AI #Transparency
PrefPalette: Personalized Preference Modeling with Latent Attributes
Personalizing AI systems requires understanding not just what users prefer, but the reasons that underlie those preferences - yet current preference models typically treat human judgment as a black bo...
arxiv.org
July 22, 2025 at 2:59 PM
🌍Bonus: PrefPalette🎨 is a computational social science goldmine!

📊 Quantify community values at scale
📈 Track how norms evolve over time
🔍 Understand group psychology
📋 Move beyond surveys to revealed preferences
July 22, 2025 at 2:59 PM
💡Potential real-world applications:

🛡️Smart content moderation—explains why content is flagged/decisions are made

🎯Interpretable LM alignment—revealing prominent attributes

⚙️Controllable personalization—giving user agency to personalize select attributes
July 22, 2025 at 2:59 PM
🔍More importantly‼️we can see WHY preferences differ:

r/AskHistorians:📚values verbosity
r/RoastMe:💥values directness
r/confession:❤️values empathy

We visualize each group’s unique preference decisions—no more one-size-fits-all. Understand your audience at a glance🏷️
July 22, 2025 at 2:59 PM
🏆Results across 45 Reddit communities:

📈Performance boost: +46.6% vs GPT-4o
💪Outperforms other training-based baselines w/ statistical significance
🕰️Robust to temporal shifts—trained pref models can be used out-of-the box!
July 22, 2025 at 2:59 PM
⚙️How it works (pt.2)

1: 🎛️Train compact, efficient detectors for every attribute

2: 🎯Learn community-specific attribute weights during preference training

3: 🔧Add attribute embeddings to preference model for accurate & explainable predictions
July 22, 2025 at 2:59 PM
⚙️How it works (prep stage)

📜Define 19 sociolinguistics & cultural attributes from literature
🏭Novel preference data generation pipeline to isolate attributes

Our data gen pipeline generates pairwise data on *any* decomposed dimension, w/ applications beyond preference modeling
July 22, 2025 at 2:59 PM