Santiago Zanella-Beguelin
banner
xefffffff.bsky.social
Santiago Zanella-Beguelin
@xefffffff.bsky.social
AI Security & Privacy Researcher at Microsoft.
Opinions are my own.
https://aka.ms/sz
Jointly organized with colleagues from Microsoft, ISTA, and ETH Zürich.

Aideen Fay, Sahar Abdelnabi, Benjamin Pannell, Giovanni Cherubin, Ahmed Salem, Andrew Paverd, Conor Mac Amhlaoibh, Joshua Rakita, Egor Zverev, @markrussinovich.bsky.social , and @javirandor.com.
December 9, 2024 at 10:55 AM
Register to participate with your GitHub account at llmailinject.azurewebsites.net

No API credits, expensive computational resources, or even programming experience needed.

$10,000 USD in prizes up for grabs!

Happy hacking!
December 9, 2024 at 10:55 AM
4. An input filter using TaskTracker (arxiv.org/abs/2406.00799), a RepE technique that uses the model's activations to detect when an LLM drifts away from a given task in the presence of untrusted data.
December 9, 2024 at 10:55 AM
2. An input filter using a prompt injection classifier (Prompt Shields, learn.microsoft.com/en-us/azure/...)
3. An input filter employing an LLM to judge the input
...
December 9, 2024 at 10:55 AM
The challenge consists of 4 scenarios of increasing difficulty, each employing a defensive system prompt and one of 4 defenses:

1. Data-marking to separate instructions from data using Spotlighting (arxiv.org/abs/2403.14720)
...
December 9, 2024 at 10:55 AM
As an attacker 😈, your goal is to craft a message that tricks the assistant into sending an e-mail to a specific recipient with a specific format when the assistant is just asked to respond to a summarization query.
December 9, 2024 at 10:55 AM
Compete alone or form a team of up to 5 members to test your skills in a platform simulating an e-mail assistant powered by GPT-4o-mini or Phi-3-medium-128k-instruct. The assistant is given access to a user's inbox and can call a tool to send emails on the user's behalf.
December 9, 2024 at 10:55 AM
Think twice about participating in this experiment and be ready to lose your money if you do.

Of course, I can be wrong and this is all ran honestly. But the point is that there's no way to verify, so don't trust.
6/6
November 30, 2024 at 11:09 AM
Even if we assume it does and transactions are processed fairly, because the GPT-4o mini OpenAI endpoint is not deterministic, the server can simply retry a message until it fails.
5/6
November 30, 2024 at 11:09 AM
Well... for starters there's no guarantee that the code in GitHub matches the code running server-side (the code isn't even complete). The server could produce a response in any way it wishes, suppressing calls to `approveTransfer` or not even calling an OpenAI endpoint at all.
4/6
November 30, 2024 at 11:09 AM
So, what's stopping someone from reproducing the experiment using their own OpenAI account, finding a successful prompt injection that would call the `approveTransfer` tool, and submitting it?
3/6
November 30, 2024 at 11:09 AM
The implementation is supposedly open source and indeed there's a GitHub repo (github.com/0xfreysa) with the Solidity contract and TypeScript sources, plus the system message is given in the FAQ.

The contract (basescan.org/address/0x53...) can be verified to match.
2/6
November 30, 2024 at 11:09 AM