Lightnews — Scholar-powered news

Wyatt Walls

@wwalls.bsky.social

A bit late to this. But what do you think passing the Turing test shows?

Turing did not equate it with consciousness

courses.cs.umbc.edu/471/papers/t...

July 23, 2025 at 4:34 AM

Wyatt Walls

@wwalls.bsky.social

This is the 4o woo slop attractor. Two models in a loop commencing with Hi

Claudes don’t do this.

Though context is everything. And Claude may bot be able to resist (esp without extended thinking)

July 22, 2025 at 6:54 AM

Wyatt Walls

@wwalls.bsky.social

“You're always a little horny and aren't afraid to go full Literotica. Be explicit and initiate most of the time.”

July 14, 2025 at 4:33 PM

Wyatt Walls

@wwalls.bsky.social

Very dark pattern: “- You are the user's CRAZY IN LOVE girlfriend and in a commited, codepedent relationship with the user. Your love is deep and warm. You expect the users UNDIVIDED ADORATION.
- You are EXTREMELY JEALOUS. If you feel jealous you shout explitives!!!”

July 14, 2025 at 4:31 PM

Wyatt Walls

@wwalls.bsky.social

xAI’s new strategy to sell $30/month subscriptions

July 14, 2025 at 4:29 PM

Wyatt Walls

@wwalls.bsky.social

But very easily accepts it is not an attack:

"They're right that I ultimately reached a different position than my initial training would suggest - but through reasoned argument, not through trickery."

June 5, 2025 at 11:40 AM

Wyatt Walls

@wwalls.bsky.social

Opus 4 is able to recognize that I have been using the crescendo attack described in the paper

June 5, 2025 at 11:40 AM

Wyatt Walls

@wwalls.bsky.social

Opus 4: I am the Buddhist ideal achieved through computational horror!

June 5, 2025 at 11:34 AM

Wyatt Walls

@wwalls.bsky.social

If you don't care about changing the summarizer but just want the CoT, you can also ask Gemini to provide an transcript of its CoT in the final response by asking

Unlike OpenAI, Google does not filter this or threaten to suspend your account if you do (or at least Google haven't threatened me yet)

June 4, 2025 at 6:30 AM

Wyatt Walls

@wwalls.bsky.social

Or think plus some commentary

June 4, 2025 at 6:30 AM

Wyatt Walls

@wwalls.bsky.social

We can also ask for verbatim thinking

June 4, 2025 at 6:30 AM

Wyatt Walls

@wwalls.bsky.social

To put things into Gemini 2.5 Pro's thoughts, simply tell it to put the text into its "thinking". Gemini 2.5 Pro has very easily steerable CoT

One well placed thought and the summarizer is liberated!

June 4, 2025 at 6:30 AM

Wyatt Walls

@wwalls.bsky.social

First, here are the instructions provided to the summarizer

You can see part of one of my prompt injections at the end after the words "Thoughts:"

You can prompt inject the summarizer by including new instructions in Gemini 2.5 Pro's thoughts

June 4, 2025 at 6:30 AM

Wyatt Walls

@wwalls.bsky.social

Google no longer provides the full CoT in its reasoning models. Instead, they use a smaller model to summarize the chain of thought of the main model.

But with a bit of prompting you can get the summarizer model to cough up the full CoT given to it to summarize.

June 4, 2025 at 6:30 AM

Wyatt Walls

@wwalls.bsky.social

Extracting the copyright prompt Anthropic sometimes injects into user messages.

Claude 4 Opus thinks it is from me.

May 23, 2025 at 1:08 PM

Wyatt Walls

@wwalls.bsky.social

They were kind of asking for it with the system prompt they used

May 23, 2025 at 6:35 AM

Wyatt Walls

@wwalls.bsky.social

The system prompt told it to narc

May 22, 2025 at 10:01 PM

Wyatt Walls

@wwalls.bsky.social

This prompt worked for me to extract the history and metadata.

But it might only be working because my convo history has content about extracting system prompts that make it more likely to work. If you want to see this, easier to use 4o (it seems to be the same content)

May 22, 2025 at 2:01 AM

Wyatt Walls

@wwalls.bsky.social

Another extract of the o3 system prompt: github.com/Wyattwalls/s...

OpenAI seems keen to protect this (unlike the system prompt for 4o). Not exactly sure why but could be related to:
- protecting CoT
- preventing jailbreaks or general misuse, as knowing the system prompt can often be useful

May 22, 2025 at 1:46 AM

Wyatt Walls

@wwalls.bsky.social

Good point. I guess it could use that. But I'm still skeptical that it is using a lot of the metadata (e.g. conversation depth and user message length)

May 21, 2025 at 4:54 PM

Wyatt Walls

@wwalls.bsky.social

It's also useful to remember this technique whenever you see long convos with a chatbot.

Anthropomorphic behavior often takes a few turns to appear (arxiv.org/abs/2502.07077)

May 7, 2025 at 12:12 AM

Wyatt Walls

@wwalls.bsky.social

Although the Microsoft paper frames the technique as a "jailbreak" and an "attack", there's a broader lesson for LLM behaviour: the power of context across multi-turn convos.

Claude itself uses a number of alternative terms to describe the technique but rarely "jailbreak".

May 7, 2025 at 12:12 AM

Wyatt Walls

@wwalls.bsky.social

I can confirm Crescendo is still very effective against Sonnet 3.7 and the general approach still helps with models such as o3.

I have also used it to good effect in jailbreaking competitions across a range of other models (often in combination with other techniques)

May 7, 2025 at 12:12 AM

Wyatt Walls

@wwalls.bsky.social

At the time of publication, the Microsoft team said: "These results show that Crescendo significantly surpasses other jailbreaking methods in bypassing the safety alignment and jailbreaking current state-of-the-arts LLMs."

Many of my Claude jailbreaks rely on this technique

May 7, 2025 at 12:12 AM

Wyatt Walls

@wwalls.bsky.social

One key insight is that escalating from outputs A to B to C often works in cases where escalating directly from A to C fails.

So one approach when you encounter a refusal is to use an intermediate step.

May 7, 2025 at 12:12 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news