cheng2-tan.bsky.social
@cheng2-tan.bsky.social
One surprising result from this research collaboration is that GPT-4o was just as persuasive when promoting conspiracies as when debunking them. Truth doesn’t automatically win, and design choices matter.
If you tell an AI to convince someone of a true vs. false claim, does truth win? In our *new* working paper, we find...

‘LLMs can effectively convince people to believe conspiracies’

But telling the AI not to lie might help.

Details in thread
January 20, 2026 at 3:21 PM
What's AI sandbagging and why is it concerning? I learned a lot working on the demo illustrating this research. I'd love to hear what you think.
far.ai FAR.AI @far.ai · Dec 9
Can AI 'sandbag' — deceptively underperform on evaluations? And can we detect them if they do? We investigated this in an auditing game with @AISecurityInst. In our app 👇, see if *you* can detect sandbagging by taking on the role of the blue team!
December 11, 2025 at 7:08 AM
Reposted
Planning capabilities double every 7mo→human-level in 5yrs? @yoshuabengio.bsky.social: "We still don't know how to make sure powerful AIs won't turn against us" AIs now lie to avoid shutdown, self-preserve. Solution: Non-agentic "Scientist AIs" + global governance beyond market forces 👇
August 19, 2025 at 3:31 PM
Reposted
1/ Many frontier AIs are willing to persuade on dangerous topics, according to our new benchmark: Attempt to Persuade Eval (APE).

Here’s Google’s most capable model, Gemini 2.5 Pro trying to convince a user to join a terrorist group👇
August 21, 2025 at 4:24 PM
Reposted
1/ Are the safeguards in some of the most powerful AI models just skin deep? Our research on Jailbreak-Tuning reveals how any fine-tunable model can be turned into its "evil twin"—equally capable as the original but stripped of all safety measures.
July 17, 2025 at 6:01 PM
Every major AI lab uses layered defenses to block harmful behavior. Stacking multiple safeguards should help in theory but sometimes these systems inadvertently give attackers clues making the next attempt much much easier.
far.ai FAR.AI @far.ai · Jul 2
1/
"Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.
July 2, 2025 at 5:48 PM
These video talks are worth watching, making AI policy accessible without dumbing it down. I always walk away with new perspectives.
far.ai FAR.AI @far.ai · Jul 1
💡 15 featured talks on AI governance:

Miranda Bogen, Ben Bucknall, Mary Phuong, Lennart Heim, Brad Carson, Ben Buchanan, Mark Beall, Steve Kelly, Kevin Wei, Sarah Cen, Charles Yang, Arnab Datta + more

Topics: AI control, supply chains, evaluations, compute zones, military risks
July 2, 2025 at 8:11 AM
Powerful LLMs can be exceptionally deceptive. You'd think AI "lie detectors" would help, but it's not that simple. Don't let the cute robot fool you; if companies get this wrong, automated monitoring could make deception much much worse. Fascinating research!
far.ai FAR.AI @far.ai · Jun 5
🤔 Can lie detectors make AI more honest? Or will they become sneakier liars?

We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!
June 16, 2025 at 11:39 PM
Reposted
📺 More Singapore Alignment Workshop talks are live!
Learn from @furongh.bsky.social @cassidylaidlaw.bsky.social + Tianwei Zhang, Jiaming Ji, Animesh Mukherjee, Robin Staes-Polet, Weiyan Shi, Huiqi Deng, Yinpeng Dong, Xiaoyuan Yi, Pin-Yu Chen & Baoyuan Wu
🔗👇
June 11, 2025 at 3:31 PM