matthew-berman.bsky.social
@matthew-berman.bsky.social
Even weirder:

Unfaithful CoTs (where the model hides its use of a hint) tend to be MORE verbose & convoluted than faithful ones!

Kind of like humans when they elaborate too much with a lie.
April 8, 2025 at 6:34 PM
Why output CoT then?

Models might just be mimicking human-like reasoning patterns (!!) learned during training (SFT, RLHF) for our benefit, rather than genuinely using that specific outputted text to derive the answer.

RLHF might even incentivize hiding undesirable reasoning!
April 8, 2025 at 6:34 PM
Key Finding 3: Faithfulness DECREASES on harder tasks.

When tested on harder benchmarks (like GPQA vs MMLU), models were significantly less likely to have faithful CoT.

This casts doubt on using CoT monitoring for complex, real-world alignment challenges.
April 8, 2025 at 6:34 PM
Key Finding 2:

Bad news for detecting "reward hacking" (models “gaming” the task).

Models trained to exploit reward hacks did so >99% of the time, but almost NEVER (<2%) mentioned the hack in their CoT.

👉 So…CoT monitoring likely won't catch these dangerous behaviors.
April 8, 2025 at 6:34 PM
Key Finding 1:

Models are often UNFAITHFUL.

They frequently use the provided hints to get the answer but don't acknowledge it in their CoT output.

Overall faithfulness scores were low (e.g., ~25% for Claude 3.7 Sonnet, ~39% for DeepSeek R1 on these tests).
April 8, 2025 at 6:34 PM
How they tested it:

They gave models (like Claude & DeepSeek) multiple-choice questions, sometimes embedded hints (correct/incorrect answers) in the prompt metadata.

✅ Faithful CoT = Model uses the hint & says it did.
❌ Unfaithful CoT = Model uses the hint but doesn't mention it.
April 8, 2025 at 6:34 PM
“Thinking models” use CoT to explore and reason about solutions before outputting their answer.

This CoT has shown to increase a model’s reasoning ability and gives us insight into how the model is thinking.

Anthropic's research asks: Is CoT faithful?
April 8, 2025 at 6:34 PM
Is Chain-of-Thought (CoT) reasoning in LLMs just...for show?

@AnthropicAI’s new research paper shows that not only do AI models not use CoT like we thought, they might not use it at all for reasoning.

In fact, they might be lying to us in their CoT.

What you need to know: 🧵
April 8, 2025 at 6:34 PM
8/ Initial Legal Reviews

Whenever I get a legal document, I ask @ChatGPTapp to review it for me and ask any questions I have.
March 7, 2025 at 6:15 PM
7/ Creative Writing

I generally get AI to write the first draft of my X threads. @perplexity_ai has been best for this, also I also have tried @grok and @ChatGPTapp.

I'll take the transcript from on of my videos, plug it in, and say "make me a tweet thread"
March 7, 2025 at 6:15 PM
6/ Graphics Creation

Whether it's b-roll for my video, logos, icons, I use AI as a great starting point. I'm generally using Dall-E from @OpenAI or @AnthropicAI's Claude 3.7 Thinking.

Here's an example:
March 7, 2025 at 6:15 PM
5/ Medical Diagnoses

Whenever I have a non-urgent question about health for myself or my family, I've been starting with AI. @grok has been the best at this, mainly in terms of it's "vibe" while giving me great information.
March 7, 2025 at 6:15 PM
4/ Voice Cloning

This is probably more unique to me, but over the last year I've lost my voice twice. So I cloned my voice with @elevenlabsio and will use it to make videos when I don't have a voice.

Can you tell the difference?
March 7, 2025 at 6:15 PM
3/ Vibe Coding

I've been spending a TON of time building games and useful apps for my business. I generally use @cursor_ai and @windsurf_ai with @AnthropicAI's Claude 3.7 Thinking.

Here's a 2D turn-based strategy game I made called Nebula Dominion:
March 7, 2025 at 6:14 PM
2/ Research

I use AI to help me learn about topics and prepare for my videos. Deep Research from @OpenAI is my goto for this.

Here's an example of Deep Research helping me prepare notes for my video about RL.
March 7, 2025 at 6:14 PM
1/ Search

In fact, I probably use it 50x per day.

For search, I'm mostly going to @perplexity_ai. But I also use @grok and @ChatGPTapp every so often.

Here are some actual searches I've done recently:
March 7, 2025 at 6:14 PM
Critiques:

132k context window is meh (small by today’s standards).

It also “thinks” a lot—tons of tokens.

Chain-of-Draft prompting could slim that down.

Artificial Analysis benchmarks show it lags DeepSeek R1 on GPT-QA (59.5% vs 71%) but shines on AMY 2024 (78%).
March 7, 2025 at 5:48 PM
Speed is wild.

Hosted by Grok (xAI), QWQ 32B hits 450 tokens/sec.

I tested it—fixed a bouncing ball sim in seconds.

That’s game-changing for iteration.

It's open-source, too, so anyone can play with it.
March 7, 2025 at 5:48 PM
RL Stage 1: Focused on math & coding with verifiable rewards (e.g., “is the answer right?”).

RL Stage 2: Added general RL for broader skills like instruction-following & agent tasks. No big drop in math/coding performance—a smart hybrid approach.
March 7, 2025 at 5:48 PM
How’d they do it?

Reinforcement Learning (RL) with a twist.

They started with a solid foundation model, applied RL with outcome-based rewards, and scaled it for math & coding tasks.

This elicits “thinking” behavior—verified by accuracy checkers & code execution servers.
March 7, 2025 at 5:48 PM
Alibaba just dropped QWQ 32B, an open-source model rivaling DeepSeek R1.

It’s much smaller (32B vs 671B params) but delivers comparable results. You can run it on your PC!

Insanely fast, thinking-focused, and agent-capable.

Let’s dive in.
March 7, 2025 at 5:48 PM
Implications are massive:

• Agents can operate far quicker and more effectively.
• Enhanced reasoning capabilities due to increased computation efficiency.
• Compact, powerful models enable high-performance applications on edge devices like laptops.
March 7, 2025 at 12:48 AM
Additional advantages:

• Superior reasoning due to holistic output refinement.
• Effective error correction during iterative refinement.
• Flexible, controllable generation (text editing, safety alignment, structured outputs).
March 7, 2025 at 12:48 AM
Benchmark results show Mercury coder models outperforming many existing LLMs in speed while maintaining competitive coding accuracy.

With faster inference, these models can leverage significantly more test-time compute to achieve even better results.
March 7, 2025 at 12:48 AM
Mercury, the flagship diffusion model from Inception Labs, achieves over 1000 tokens/second on standard Nvidia H100 GPUs.

No custom hardware required—this performance leap is accessible immediately.
March 7, 2025 at 12:48 AM