matthew-berman.bsky.social
@matthew-berman.bsky.social
Even weirder:

Unfaithful CoTs (where the model hides its use of a hint) tend to be MORE verbose & convoluted than faithful ones!

Kind of like humans when they elaborate too much with a lie.
April 8, 2025 at 6:34 PM
Why output CoT then?

Models might just be mimicking human-like reasoning patterns (!!) learned during training (SFT, RLHF) for our benefit, rather than genuinely using that specific outputted text to derive the answer.

RLHF might even incentivize hiding undesirable reasoning!
April 8, 2025 at 6:34 PM
Key Finding 3: Faithfulness DECREASES on harder tasks.

When tested on harder benchmarks (like GPQA vs MMLU), models were significantly less likely to have faithful CoT.

This casts doubt on using CoT monitoring for complex, real-world alignment challenges.
April 8, 2025 at 6:34 PM
Key Finding 2:

Bad news for detecting "reward hacking" (models “gaming” the task).

Models trained to exploit reward hacks did so >99% of the time, but almost NEVER (<2%) mentioned the hack in their CoT.

👉 So…CoT monitoring likely won't catch these dangerous behaviors.
April 8, 2025 at 6:34 PM
Key Finding 1:

Models are often UNFAITHFUL.

They frequently use the provided hints to get the answer but don't acknowledge it in their CoT output.

Overall faithfulness scores were low (e.g., ~25% for Claude 3.7 Sonnet, ~39% for DeepSeek R1 on these tests).
April 8, 2025 at 6:34 PM
How they tested it:

They gave models (like Claude & DeepSeek) multiple-choice questions, sometimes embedded hints (correct/incorrect answers) in the prompt metadata.

✅ Faithful CoT = Model uses the hint & says it did.
❌ Unfaithful CoT = Model uses the hint but doesn't mention it.
April 8, 2025 at 6:34 PM
“Thinking models” use CoT to explore and reason about solutions before outputting their answer.

This CoT has shown to increase a model’s reasoning ability and gives us insight into how the model is thinking.

Anthropic's research asks: Is CoT faithful?
April 8, 2025 at 6:34 PM
I probably have 10 more use cases I'm not thinking of...but that's a good start!

What are you using AI for? Did I miss anything important?
March 7, 2025 at 6:15 PM
8/ Initial Legal Reviews

Whenever I get a legal document, I ask @ChatGPTapp to review it for me and ask any questions I have.
March 7, 2025 at 6:15 PM
7/ Creative Writing

I generally get AI to write the first draft of my X threads. @perplexity_ai has been best for this, also I also have tried @grok and @ChatGPTapp.

I'll take the transcript from on of my videos, plug it in, and say "make me a tweet thread"
March 7, 2025 at 6:15 PM
6/ Graphics Creation

Whether it's b-roll for my video, logos, icons, I use AI as a great starting point. I'm generally using Dall-E from @OpenAI or @AnthropicAI's Claude 3.7 Thinking.

Here's an example:
March 7, 2025 at 6:15 PM
5/ Medical Diagnoses

Whenever I have a non-urgent question about health for myself or my family, I've been starting with AI. @grok has been the best at this, mainly in terms of it's "vibe" while giving me great information.
March 7, 2025 at 6:15 PM
4/ Voice Cloning

This is probably more unique to me, but over the last year I've lost my voice twice. So I cloned my voice with @elevenlabsio and will use it to make videos when I don't have a voice.

Can you tell the difference?
March 7, 2025 at 6:15 PM
3/ Vibe Coding

I've been spending a TON of time building games and useful apps for my business. I generally use @cursor_ai and @windsurf_ai with @AnthropicAI's Claude 3.7 Thinking.

Here's a 2D turn-based strategy game I made called Nebula Dominion:
March 7, 2025 at 6:14 PM
2/ Research

I use AI to help me learn about topics and prepare for my videos. Deep Research from @OpenAI is my goto for this.

Here's an example of Deep Research helping me prepare notes for my video about RL.
March 7, 2025 at 6:14 PM
1/ Search

In fact, I probably use it 50x per day.

For search, I'm mostly going to @perplexity_ai. But I also use @grok and @ChatGPTapp every so often.

Here are some actual searches I've done recently:
March 7, 2025 at 6:14 PM
Still, QWQ 32B is 20x smaller than DeepSeek R1 (65GB vs 671GB).

Even beats DeepSeek’s 37B active params in MoE setups.

Efficiency + power + open-source = huge potential.

Go play with it and let me know what you think!

x.com/Alibaba_Qwe...
March 7, 2025 at 5:48 PM
Critiques:

132k context window is meh (small by today’s standards).

It also “thinks” a lot—tons of tokens.

Chain-of-Draft prompting could slim that down.

Artificial Analysis benchmarks show it lags DeepSeek R1 on GPT-QA (59.5% vs 71%) but shines on AMY 2024 (78%).
March 7, 2025 at 5:48 PM
Speed is wild.

Hosted by Grok (xAI), QWQ 32B hits 450 tokens/sec.

I tested it—fixed a bouncing ball sim in seconds.

That’s game-changing for iteration.

It's open-source, too, so anyone can play with it.
March 7, 2025 at 5:48 PM
RL Stage 1: Focused on math & coding with verifiable rewards (e.g., “is the answer right?”).

RL Stage 2: Added general RL for broader skills like instruction-following & agent tasks. No big drop in math/coding performance—a smart hybrid approach.
March 7, 2025 at 5:48 PM
How’d they do it?

Reinforcement Learning (RL) with a twist.

They started with a solid foundation model, applied RL with outcome-based rewards, and scaled it for math & coding tasks.

This elicits “thinking” behavior—verified by accuracy checkers & code execution servers.
March 7, 2025 at 5:48 PM