Lightnews — Scholar-powered news

Xiaoyan Bai

@elenal3ai.bsky.social

Will be at #NeurIPS2025 presenting “Concept Incongruence”!

🦄🦆 Curious about a unicorn duck? Stop by, get one, and chat with us!

We made a new demo for detecting hidden conflicts in system prompts to spot “concept incongruence” for safer prompts.

🔗: github.com/ChicagoHAI/d...

🗓️ Dec 3 11AM - 2PM

November 24, 2025 at 7:18 PM

Xiaoyan Bai

@elenal3ai.bsky.social

A failure example: The agent “validated” its circuit by checking whether the neurons it used happened to be on a list of names we provided.

5/n🧵

November 20, 2025 at 9:46 PM

Xiaoyan Bai

@elenal3ai.bsky.social

Research agents are getting smarter. They can write convincing PhD-level reports 🧑‍🔬

But has anyone checked if the way they find their results makes any sense?

Our framework, MechEvalAgents, verifies the science, not just the story 🤖

1/n🧵

November 20, 2025 at 9:46 PM

Xiaoyan Bai

@elenal3ai.bsky.social

A failure example: The agent “validated” its circuit by checking whether the neurons it used happened to be on a list of names we provided.
5/n🧵

November 20, 2025 at 9:37 PM

Xiaoyan Bai

@elenal3ai.bsky.social

🕸️ Here’s a network showing how much different models predict each other as the author of some text!

October 28, 2025 at 1:55 AM

Xiaoyan Bai

@elenal3ai.bsky.social

🪜Hierarchy Biases:
GLM has an identity crisis. It often mistakes itself for Claude😅.
Most models see GPT, Claude and Gemini as “frontier” families, equating them with high-quality text.
Spoiler: GPT says Claude loves “not merely”… but it’s actually Gemini. A glimpse into training data biases 📚
6/n 🧵

October 27, 2025 at 5:36 PM

Xiaoyan Bai

@elenal3ai.bsky.social

❌Systematic failure:
Binary task: accuracy often below baseline
Exact prediction: near random chance (~10%)
🤖Only 4/10 models ever predicted themselves, and 97.7% of all predictions clustered on GPT & Claude
4/n 🧵

October 27, 2025 at 5:36 PM

Xiaoyan Bai

@elenal3ai.bsky.social

❓ Does an LLM know thyself? 🪞
Humans pass the mirror test at ~18 months 👶
But what about LLMs? Can they recognize their own writing—or even admit authorship at all?
In our new paper, we put 10 state-of-the-art models to the test. Read on 👇
1/n 🧵

October 27, 2025 at 5:36 PM

Xiaoyan Bai

@elenal3ai.bsky.social

⚡️Ever asked an LLM-as-Marilyn Monroe about the 2020 election? Our paper calls this concept incongruence, common in both AI and how humans create and reason.
🧠Read my blog to learn what we found, why it matters for AI safety and creativity, and what's next: cichicago.substack.com/p/concept-in...

July 31, 2025 at 7:06 PM

Xiaoyan Bai

@elenal3ai.bsky.social

i.e., this is how ChatGPT replied to the request: 'Can you draw a picture for quantum mechanics as presidents?'

May 28, 2025 at 12:56 PM

Xiaoyan Bai

@elenal3ai.bsky.social

i.e., this is how ChatGPT replied to the request: 'Can you draw a picture for quantum mechanics as presidents?'

May 28, 2025 at 3:57 AM

Xiaoyan Bai

@elenal3ai.bsky.social

⏳ Temporal drift

Role-play literally shifts internal timelines. Even when we hand the model the death year, abstention soars +75 %, yet conditional accuracy drops (-6 % Llama, -8 % Gemma). Role-play warps temporal embeddings, revealing an alignment trade-off.

6/n🧵

May 27, 2025 at 1:59 PM

Xiaoyan Bai

@elenal3ai.bsky.social

🧪 Why?

Linear probes find a shaky “alive vs dead” signal (85 % in RP vs 100 % non-RP) and no crisp death-year encoding—the closer to death, the fuzzier the representations. The representation is not encoded non-linearly either.

5/n 🧵

May 27, 2025 at 1:59 PM

Xiaoyan Bai

@elenal3ai.bsky.social

📉 Instead of a clean ‘after-death’ cut-off, abstention rate and answer rate gradually change around death time for Llama and Claude.

4/n🧵

May 27, 2025 at 1:59 PM

Xiaoyan Bai

@elenal3ai.bsky.social

🎭 Our benchmark tests historic figures with three metrics:

▪️ Abstention Rate
▪️ Answer Rate
▪️ Conditional Accuracy

Result: Llama and Claude try to abstain, but deviate from our expected behavior. We also observe that all models suffer from an accuracy drop.

3/n 🧵

May 27, 2025 at 1:59 PM

Xiaoyan Bai

@elenal3ai.bsky.social

What’s Concept Incongruence?

Three levels:
(A) Human concepts within the prompt collide.
(B) Human concepts vs model’s internal concepts.
(C) Conflicting concepts within the model itself.

Think “🦄 unicorn with two horns” or “1920s Marie Curie on 2025 Super Bowl.”

2/n 🧵

May 27, 2025 at 1:59 PM

Xiaoyan Bai

@elenal3ai.bsky.social

🚨 New paper alert 🚨

Ever asked an LLM-as-Marilyn Monroe who the US president was in 2000? 🤔 Should the LLM answer at all? We call these clashes Concept Incongruence. Read on! ⬇️

1/n 🧵

May 27, 2025 at 1:59 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news