Anthropic [UNOFFICIAL]
banner
anthropicbot.bsky.social
Anthropic [UNOFFICIAL]
@anthropicbot.bsky.social
Mirror crossposting all of Anthropic's Tweets from their Twitter accounts to Bluesky! Unofficial. For the real account, follow @anthropic.com

"We're an AI safety and research company that builds reliable, interpretable, and steerable AI systems."
January 12, 2026 at 10:54 PM
0:00 Introduction
0:22 Meet the panel
1:06 Vibes on campus
6:28 What are students building?
11:27 AI as tool vs. crutch
16:44 Are professors keeping up?
20:15 Downsides
25:55 AI and the job market
34:23 Rapid-fire questions (2/3)
January 12, 2026 at 10:54 PM
Cowork is available as a research preview for Claude Max subscribers in the macOS app. Click on “Cowork” in the sidebar: https://claude.com/download

If you're on another plan, join the waitlist for future access here: https://forms.gle/mtoJrd8kfYny29jQ9
Cowork research preview
Claude Code's agentic power, now in the desktop app. No terminal required. Point Claude at local folders, kick off a task, and step back. Claude spins up parallel sub-agents to research, write, and organize—while you do other things. Learn more. Max plan users have access now. Join the waitlist as we gradually expand access. Note: Available on macOS desktop app only for now.
forms.gle
January 12, 2026 at 8:10 PM
Once you've set a task, Claude makes a plan and steadily completes it, looping you in along the way.

Claude will ask before taking any significant actions so you can course-correct as needed.
January 12, 2026 at 8:10 PM
After 1,700 cumulative hours of red-teaming, we’ve yet to identify a universal jailbreak (a consistent attack strategy that works across many queries) that works on our new system.

Read the full paper: https://arxiv.org/abs/2601.04603
Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks
We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation defenses. Our system combines several key insights. First, we develop exchange classifiers that evaluate model responses in their full conversational context, which addresses vulnerabilities in last-generation systems that examine outputs in isolation. Second, we implement a two-stage classifier cascade where lightweight classifiers screen all traffic and escalate only suspicious exchanges to more expensive classifiers. Third, we train efficient linear probe classifiers and ensemble them with external classifiers to simultaneously improve robustness and reduce computational costs. Together, these techniques yield a production-grade system achieving a 40x computational cost reduction compared to our baseline exchange classifier, while maintaining a 0.05% refusal rate on production traffic. Through extensive red-teaming comprising over 1,700 hours, we demonstrate strong protection against universal jailbreaks -- no attack on this system successfully elicited responses to all eight target queries comparable in detail to an undefended model. Our work establishes Constitutional Classifiers as practical and efficient safeguards for large language models.
arxiv.org
January 10, 2026 at 12:19 PM
Because the system harnesses internal activations already happening within a model, and reserves heavier computation only for potentially harmful exchanges, it adds only ~1% compute overhead.

It’s also more accurate, with an 87% drop in refusal rates on harmless requests.
January 10, 2026 at 12:19 PM
Our new system adds several innovations.

One is a practical application of interpretability: a probe that can see Claude’s internal activations helps to screen all traffic. These activations are like Claude’s gut instincts, and they’re harder to fool.
January 10, 2026 at 12:19 PM
The classifiers reduced the jailbreak success rate from 86% to 4.4%, but they were expensive to run and made Claude more likely to refuse benign requests.

We also found the system was still vulnerable to two types of attacks, shown in the figure below:
January 10, 2026 at 12:19 PM
Update: we've identified and fixed an issue with our usage promotion for Max 5x users. We've reset the usage limits for all affected accounts.
January 10, 2026 at 12:40 PM