Andrea Palmieri 🤌
andpalmier.com
Andrea Palmieri 🤌
@andpalmier.com
Threat analyst, eternal newbie / Italian 🍕 in 🇨🇭 / AS Roma 💛❤️

🔗 andpalmier.com
🤖 LLMs vs LLMs

It shouldn't really come as a big surprise that some methods for attacking LLMs are using LLMs.

Here are two examples:
- PAIR: an approach using an attacker LLM
- IRIS: inducing an LLM to self-jailbreak

⬇️
November 25, 2024 at 7:08 AM
📝 #Prompt rewriting: adding a layer of linguistic complexity!

This class of attacks uses encryption, translation, ascii art and even word puzzles to bypass the LLMs' safety checks.

⬇️
November 25, 2024 at 7:08 AM
💉 #Promptinjection: embed malicious instructions in the prompt.

According to #OWASP, prompt injection is the most critical security risk for LLM applications.

They break down this class of attacks in 2 categories: direct and indirect. Here is a summary of indirect attacks:

⬇️
November 25, 2024 at 7:08 AM
😈 Role-playing: attackers ask the #LLM to act as a specific persona or as part of a scenario.

A common example is the (in?)famous #DAN (Do Anything Now):

This attacks are probably the most common in the real-word, as they often don't require a lot of sophistication.

⬇️
November 25, 2024 at 7:08 AM
We interact (and therefore attack) LLMs mainly using language, therefore let's start from there.

I used this dataset github.com/verazuo/jailbreak_llms of #jailbreak #prompt to create this wordcloud.

I believe it gives a sense of "what works" in these attacks!

⬇️
November 25, 2024 at 7:08 AM