Unfaithful CoTs (where the model hides its use of a hint) tend to be MORE verbose & convoluted than faithful ones!
Kind of like humans when they elaborate too much with a lie.
Unfaithful CoTs (where the model hides its use of a hint) tend to be MORE verbose & convoluted than faithful ones!
Kind of like humans when they elaborate too much with a lie.
Models might just be mimicking human-like reasoning patterns (!!) learned during training (SFT, RLHF) for our benefit, rather than genuinely using that specific outputted text to derive the answer.
RLHF might even incentivize hiding undesirable reasoning!
Models might just be mimicking human-like reasoning patterns (!!) learned during training (SFT, RLHF) for our benefit, rather than genuinely using that specific outputted text to derive the answer.
RLHF might even incentivize hiding undesirable reasoning!
When tested on harder benchmarks (like GPQA vs MMLU), models were significantly less likely to have faithful CoT.
This casts doubt on using CoT monitoring for complex, real-world alignment challenges.
When tested on harder benchmarks (like GPQA vs MMLU), models were significantly less likely to have faithful CoT.
This casts doubt on using CoT monitoring for complex, real-world alignment challenges.
Bad news for detecting "reward hacking" (models “gaming” the task).
Models trained to exploit reward hacks did so >99% of the time, but almost NEVER (<2%) mentioned the hack in their CoT.
👉 So…CoT monitoring likely won't catch these dangerous behaviors.
Bad news for detecting "reward hacking" (models “gaming” the task).
Models trained to exploit reward hacks did so >99% of the time, but almost NEVER (<2%) mentioned the hack in their CoT.
👉 So…CoT monitoring likely won't catch these dangerous behaviors.
Models are often UNFAITHFUL.
They frequently use the provided hints to get the answer but don't acknowledge it in their CoT output.
Overall faithfulness scores were low (e.g., ~25% for Claude 3.7 Sonnet, ~39% for DeepSeek R1 on these tests).
Models are often UNFAITHFUL.
They frequently use the provided hints to get the answer but don't acknowledge it in their CoT output.
Overall faithfulness scores were low (e.g., ~25% for Claude 3.7 Sonnet, ~39% for DeepSeek R1 on these tests).
They gave models (like Claude & DeepSeek) multiple-choice questions, sometimes embedded hints (correct/incorrect answers) in the prompt metadata.
✅ Faithful CoT = Model uses the hint & says it did.
❌ Unfaithful CoT = Model uses the hint but doesn't mention it.
They gave models (like Claude & DeepSeek) multiple-choice questions, sometimes embedded hints (correct/incorrect answers) in the prompt metadata.
✅ Faithful CoT = Model uses the hint & says it did.
❌ Unfaithful CoT = Model uses the hint but doesn't mention it.
This CoT has shown to increase a model’s reasoning ability and gives us insight into how the model is thinking.
Anthropic's research asks: Is CoT faithful?
This CoT has shown to increase a model’s reasoning ability and gives us insight into how the model is thinking.
Anthropic's research asks: Is CoT faithful?
@AnthropicAI’s new research paper shows that not only do AI models not use CoT like we thought, they might not use it at all for reasoning.
In fact, they might be lying to us in their CoT.
What you need to know: 🧵
@AnthropicAI’s new research paper shows that not only do AI models not use CoT like we thought, they might not use it at all for reasoning.
In fact, they might be lying to us in their CoT.
What you need to know: 🧵
Whenever I get a legal document, I ask @ChatGPTapp to review it for me and ask any questions I have.
Whenever I get a legal document, I ask @ChatGPTapp to review it for me and ask any questions I have.
I generally get AI to write the first draft of my X threads. @perplexity_ai has been best for this, also I also have tried @grok and @ChatGPTapp.
I'll take the transcript from on of my videos, plug it in, and say "make me a tweet thread"
I generally get AI to write the first draft of my X threads. @perplexity_ai has been best for this, also I also have tried @grok and @ChatGPTapp.
I'll take the transcript from on of my videos, plug it in, and say "make me a tweet thread"
Whether it's b-roll for my video, logos, icons, I use AI as a great starting point. I'm generally using Dall-E from @OpenAI or @AnthropicAI's Claude 3.7 Thinking.
Here's an example:
Whether it's b-roll for my video, logos, icons, I use AI as a great starting point. I'm generally using Dall-E from @OpenAI or @AnthropicAI's Claude 3.7 Thinking.
Here's an example:
Whenever I have a non-urgent question about health for myself or my family, I've been starting with AI. @grok has been the best at this, mainly in terms of it's "vibe" while giving me great information.
Whenever I have a non-urgent question about health for myself or my family, I've been starting with AI. @grok has been the best at this, mainly in terms of it's "vibe" while giving me great information.
This is probably more unique to me, but over the last year I've lost my voice twice. So I cloned my voice with @elevenlabsio and will use it to make videos when I don't have a voice.
Can you tell the difference?
This is probably more unique to me, but over the last year I've lost my voice twice. So I cloned my voice with @elevenlabsio and will use it to make videos when I don't have a voice.
Can you tell the difference?
I've been spending a TON of time building games and useful apps for my business. I generally use @cursor_ai and @windsurf_ai with @AnthropicAI's Claude 3.7 Thinking.
Here's a 2D turn-based strategy game I made called Nebula Dominion:
I've been spending a TON of time building games and useful apps for my business. I generally use @cursor_ai and @windsurf_ai with @AnthropicAI's Claude 3.7 Thinking.
Here's a 2D turn-based strategy game I made called Nebula Dominion:
I use AI to help me learn about topics and prepare for my videos. Deep Research from @OpenAI is my goto for this.
Here's an example of Deep Research helping me prepare notes for my video about RL.
I use AI to help me learn about topics and prepare for my videos. Deep Research from @OpenAI is my goto for this.
Here's an example of Deep Research helping me prepare notes for my video about RL.
In fact, I probably use it 50x per day.
For search, I'm mostly going to @perplexity_ai. But I also use @grok and @ChatGPTapp every so often.
Here are some actual searches I've done recently:
In fact, I probably use it 50x per day.
For search, I'm mostly going to @perplexity_ai. But I also use @grok and @ChatGPTapp every so often.
Here are some actual searches I've done recently:
132k context window is meh (small by today’s standards).
It also “thinks” a lot—tons of tokens.
Chain-of-Draft prompting could slim that down.
Artificial Analysis benchmarks show it lags DeepSeek R1 on GPT-QA (59.5% vs 71%) but shines on AMY 2024 (78%).
132k context window is meh (small by today’s standards).
It also “thinks” a lot—tons of tokens.
Chain-of-Draft prompting could slim that down.
Artificial Analysis benchmarks show it lags DeepSeek R1 on GPT-QA (59.5% vs 71%) but shines on AMY 2024 (78%).
Hosted by Grok (xAI), QWQ 32B hits 450 tokens/sec.
I tested it—fixed a bouncing ball sim in seconds.
That’s game-changing for iteration.
It's open-source, too, so anyone can play with it.
Hosted by Grok (xAI), QWQ 32B hits 450 tokens/sec.
I tested it—fixed a bouncing ball sim in seconds.
That’s game-changing for iteration.
It's open-source, too, so anyone can play with it.
RL Stage 2: Added general RL for broader skills like instruction-following & agent tasks. No big drop in math/coding performance—a smart hybrid approach.
RL Stage 2: Added general RL for broader skills like instruction-following & agent tasks. No big drop in math/coding performance—a smart hybrid approach.
Reinforcement Learning (RL) with a twist.
They started with a solid foundation model, applied RL with outcome-based rewards, and scaled it for math & coding tasks.
This elicits “thinking” behavior—verified by accuracy checkers & code execution servers.
Reinforcement Learning (RL) with a twist.
They started with a solid foundation model, applied RL with outcome-based rewards, and scaled it for math & coding tasks.
This elicits “thinking” behavior—verified by accuracy checkers & code execution servers.
It’s much smaller (32B vs 671B params) but delivers comparable results. You can run it on your PC!
Insanely fast, thinking-focused, and agent-capable.
Let’s dive in.
It’s much smaller (32B vs 671B params) but delivers comparable results. You can run it on your PC!
Insanely fast, thinking-focused, and agent-capable.
Let’s dive in.
• Agents can operate far quicker and more effectively.
• Enhanced reasoning capabilities due to increased computation efficiency.
• Compact, powerful models enable high-performance applications on edge devices like laptops.
• Agents can operate far quicker and more effectively.
• Enhanced reasoning capabilities due to increased computation efficiency.
• Compact, powerful models enable high-performance applications on edge devices like laptops.
• Superior reasoning due to holistic output refinement.
• Effective error correction during iterative refinement.
• Flexible, controllable generation (text editing, safety alignment, structured outputs).
• Superior reasoning due to holistic output refinement.
• Effective error correction during iterative refinement.
• Flexible, controllable generation (text editing, safety alignment, structured outputs).
With faster inference, these models can leverage significantly more test-time compute to achieve even better results.
With faster inference, these models can leverage significantly more test-time compute to achieve even better results.
No custom hardware required—this performance leap is accessible immediately.
No custom hardware required—this performance leap is accessible immediately.