Isabelle Lee
banner
wordscompute.bsky.social
Isabelle Lee
@wordscompute.bsky.social
ml/nlp phding @ usc, currently visiting harvard, scientisting @ startup;
interpretability & training & reasoning

iglee.me
paper: arxiv.org/abs/2505.14932
dataset: huggingface.co/datasets/fo...
work w/ @sarahliaw.bsky.social and Dani Yogatama

If you want to chat about interpretability & training dynamics & reasoning and munch on mezzes, come hang out with me in Rabat 🇲🇦🙃
9/9
fol-traces/fol-traces · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
February 11, 2026 at 5:17 PM
I wanted to study reasoning acquisition in training by complexity + process fidelity but wasn't able to find a dataset. So we built one that's rigorously annotated and large enough to train a small LM. Now I’m excited about what we can do with it
8/9
February 11, 2026 at 5:17 PM
a harder task- last step prediction: ¬(¬Sunny(x) ∧ Breezy(x)) ↔ [MASK] or last two step prediction. Most LLMs only achieve <50% accuracy on both tasks.

(n.b. since FOL is verifiable, we define correct as any generation that's equivalent to expression.)
7/9
February 11, 2026 at 5:17 PM
e.g. masked prediction. we mask an operator randomly and have LLMs guess: ¬(¬Sunny(x) ∧ Breezy(x)) ↔ (Sunny(x) [MASK] Breezy(x)). LLMs are correct ~45.7% on average:
6/9
February 11, 2026 at 5:17 PM
...resulting in a bunch of reasoning traces that are verifiably correct with measurable programmatic complexity. And we find that they're very hard for LLMs!

Let's consider an example w/ de Morgan's law: ¬(¬Sunny(x) ∧ Breezy(x)) ↔ (Sunny(x) ∨ Breezy(x))
5/9
February 11, 2026 at 5:17 PM
So how do we strike a balance? We propose using First-Order Logic (FOL) as a middle ground. We
1. programmatically, randomly generate a bunch of FOL expressions
2. progressively simplify them, verifying their equivalence
3. chain them together
4. NL instantiate them w/ LLMs
4/9
February 11, 2026 at 5:17 PM
We mostly interface with LLMs with words but evaluating NL reasoning is messy. On the other hand, something like math reasoning gives us concrete, objectively correct answers. But it’s narrow/doesn’t look like NL.
3/9
February 11, 2026 at 5:17 PM
There are many evals and benchmarks in this field, but natural language (NL) reasoning is tricky--meaning depends on context (commonsense), shared assumptions (pragmatics), and what’s unsaid (abduction). Pattern shortcuts/heuristics ≠ logical inference.
2/9
February 11, 2026 at 5:17 PM
If you're interested in interpretability driven evaluations, I'd love to hear from you! And stay tuned for more work from us :)
February 11, 2026 at 5:07 PM
or if you're awesome and happen to be in sf, also message me
March 15, 2025 at 1:51 AM
pls message me if you wanna meet up for coffee and chat about ai/physics/llms/interpretability
March 15, 2025 at 1:42 AM
i use the same template and need help getting a butterfly button help
March 5, 2025 at 2:13 AM