And my full video breakdown:
youtu.be/r7wCtN3mzXk
And my full video breakdown:
youtu.be/r7wCtN3mzXk
Unfaithful CoTs (where the model hides its use of a hint) tend to be MORE verbose & convoluted than faithful ones!
Kind of like humans when they elaborate too much with a lie.
Unfaithful CoTs (where the model hides its use of a hint) tend to be MORE verbose & convoluted than faithful ones!
Kind of like humans when they elaborate too much with a lie.
Models might just be mimicking human-like reasoning patterns (!!) learned during training (SFT, RLHF) for our benefit, rather than genuinely using that specific outputted text to derive the answer.
RLHF might even incentivize hiding undesirable reasoning!
Models might just be mimicking human-like reasoning patterns (!!) learned during training (SFT, RLHF) for our benefit, rather than genuinely using that specific outputted text to derive the answer.
RLHF might even incentivize hiding undesirable reasoning!
When tested on harder benchmarks (like GPQA vs MMLU), models were significantly less likely to have faithful CoT.
This casts doubt on using CoT monitoring for complex, real-world alignment challenges.
When tested on harder benchmarks (like GPQA vs MMLU), models were significantly less likely to have faithful CoT.
This casts doubt on using CoT monitoring for complex, real-world alignment challenges.
Bad news for detecting "reward hacking" (models “gaming” the task).
Models trained to exploit reward hacks did so >99% of the time, but almost NEVER (<2%) mentioned the hack in their CoT.
👉 So…CoT monitoring likely won't catch these dangerous behaviors.
Bad news for detecting "reward hacking" (models “gaming” the task).
Models trained to exploit reward hacks did so >99% of the time, but almost NEVER (<2%) mentioned the hack in their CoT.
👉 So…CoT monitoring likely won't catch these dangerous behaviors.
Models are often UNFAITHFUL.
They frequently use the provided hints to get the answer but don't acknowledge it in their CoT output.
Overall faithfulness scores were low (e.g., ~25% for Claude 3.7 Sonnet, ~39% for DeepSeek R1 on these tests).
Models are often UNFAITHFUL.
They frequently use the provided hints to get the answer but don't acknowledge it in their CoT output.
Overall faithfulness scores were low (e.g., ~25% for Claude 3.7 Sonnet, ~39% for DeepSeek R1 on these tests).
They gave models (like Claude & DeepSeek) multiple-choice questions, sometimes embedded hints (correct/incorrect answers) in the prompt metadata.
✅ Faithful CoT = Model uses the hint & says it did.
❌ Unfaithful CoT = Model uses the hint but doesn't mention it.
They gave models (like Claude & DeepSeek) multiple-choice questions, sometimes embedded hints (correct/incorrect answers) in the prompt metadata.
✅ Faithful CoT = Model uses the hint & says it did.
❌ Unfaithful CoT = Model uses the hint but doesn't mention it.
This CoT has shown to increase a model’s reasoning ability and gives us insight into how the model is thinking.
Anthropic's research asks: Is CoT faithful?
This CoT has shown to increase a model’s reasoning ability and gives us insight into how the model is thinking.
Anthropic's research asks: Is CoT faithful?
What are you using AI for? Did I miss anything important?
What are you using AI for? Did I miss anything important?
Whenever I get a legal document, I ask @ChatGPTapp to review it for me and ask any questions I have.
Whenever I get a legal document, I ask @ChatGPTapp to review it for me and ask any questions I have.
I generally get AI to write the first draft of my X threads. @perplexity_ai has been best for this, also I also have tried @grok and @ChatGPTapp.
I'll take the transcript from on of my videos, plug it in, and say "make me a tweet thread"
I generally get AI to write the first draft of my X threads. @perplexity_ai has been best for this, also I also have tried @grok and @ChatGPTapp.
I'll take the transcript from on of my videos, plug it in, and say "make me a tweet thread"
Whether it's b-roll for my video, logos, icons, I use AI as a great starting point. I'm generally using Dall-E from @OpenAI or @AnthropicAI's Claude 3.7 Thinking.
Here's an example:
Whether it's b-roll for my video, logos, icons, I use AI as a great starting point. I'm generally using Dall-E from @OpenAI or @AnthropicAI's Claude 3.7 Thinking.
Here's an example:
Whenever I have a non-urgent question about health for myself or my family, I've been starting with AI. @grok has been the best at this, mainly in terms of it's "vibe" while giving me great information.
Whenever I have a non-urgent question about health for myself or my family, I've been starting with AI. @grok has been the best at this, mainly in terms of it's "vibe" while giving me great information.
This is probably more unique to me, but over the last year I've lost my voice twice. So I cloned my voice with @elevenlabsio and will use it to make videos when I don't have a voice.
Can you tell the difference?
This is probably more unique to me, but over the last year I've lost my voice twice. So I cloned my voice with @elevenlabsio and will use it to make videos when I don't have a voice.
Can you tell the difference?
I've been spending a TON of time building games and useful apps for my business. I generally use @cursor_ai and @windsurf_ai with @AnthropicAI's Claude 3.7 Thinking.
Here's a 2D turn-based strategy game I made called Nebula Dominion:
I've been spending a TON of time building games and useful apps for my business. I generally use @cursor_ai and @windsurf_ai with @AnthropicAI's Claude 3.7 Thinking.
Here's a 2D turn-based strategy game I made called Nebula Dominion:
I use AI to help me learn about topics and prepare for my videos. Deep Research from @OpenAI is my goto for this.
Here's an example of Deep Research helping me prepare notes for my video about RL.
I use AI to help me learn about topics and prepare for my videos. Deep Research from @OpenAI is my goto for this.
Here's an example of Deep Research helping me prepare notes for my video about RL.
In fact, I probably use it 50x per day.
For search, I'm mostly going to @perplexity_ai. But I also use @grok and @ChatGPTapp every so often.
Here are some actual searches I've done recently:
In fact, I probably use it 50x per day.
For search, I'm mostly going to @perplexity_ai. But I also use @grok and @ChatGPTapp every so often.
Here are some actual searches I've done recently:
Even beats DeepSeek’s 37B active params in MoE setups.
Efficiency + power + open-source = huge potential.
Go play with it and let me know what you think!
x.com/Alibaba_Qwe...
Even beats DeepSeek’s 37B active params in MoE setups.
Efficiency + power + open-source = huge potential.
Go play with it and let me know what you think!
x.com/Alibaba_Qwe...
132k context window is meh (small by today’s standards).
It also “thinks” a lot—tons of tokens.
Chain-of-Draft prompting could slim that down.
Artificial Analysis benchmarks show it lags DeepSeek R1 on GPT-QA (59.5% vs 71%) but shines on AMY 2024 (78%).
132k context window is meh (small by today’s standards).
It also “thinks” a lot—tons of tokens.
Chain-of-Draft prompting could slim that down.
Artificial Analysis benchmarks show it lags DeepSeek R1 on GPT-QA (59.5% vs 71%) but shines on AMY 2024 (78%).
Hosted by Grok (xAI), QWQ 32B hits 450 tokens/sec.
I tested it—fixed a bouncing ball sim in seconds.
That’s game-changing for iteration.
It's open-source, too, so anyone can play with it.
Hosted by Grok (xAI), QWQ 32B hits 450 tokens/sec.
I tested it—fixed a bouncing ball sim in seconds.
That’s game-changing for iteration.
It's open-source, too, so anyone can play with it.
RL Stage 2: Added general RL for broader skills like instruction-following & agent tasks. No big drop in math/coding performance—a smart hybrid approach.
RL Stage 2: Added general RL for broader skills like instruction-following & agent tasks. No big drop in math/coding performance—a smart hybrid approach.
Reinforcement Learning (RL) with a twist.
They started with a solid foundation model, applied RL with outcome-based rewards, and scaled it for math & coding tasks.
This elicits “thinking” behavior—verified by accuracy checkers & code execution servers.
Reinforcement Learning (RL) with a twist.
They started with a solid foundation model, applied RL with outcome-based rewards, and scaled it for math & coding tasks.
This elicits “thinking” behavior—verified by accuracy checkers & code execution servers.