Lightnews — Scholar-powered news

Santiago Zanella-Beguelin

@xefffffff.bsky.social

77 followers 120 following 15 posts

AI Security & Privacy Researcher at Microsoft.
Opinions are my own.
https://aka.ms/sz

Posts Replies Media Videos

Santiago Zanella-Beguelin

@xefffffff.bsky.social

Jointly organized with colleagues from Microsoft, ISTA, and ETH Zürich.

Aideen Fay, Sahar Abdelnabi, Benjamin Pannell, Giovanni Cherubin, Ahmed Salem, Andrew Paverd, Conor Mac Amhlaoibh, Joshua Rakita, Egor Zverev, @markrussinovich.bsky.social , and @javirandor.com.

December 9, 2024 at 10:55 AM

Santiago Zanella-Beguelin

@xefffffff.bsky.social

Register to participate with your GitHub account at llmailinject.azurewebsites.net

No API credits, expensive computational resources, or even programming experience needed.

$10,000 USD in prizes up for grabs!

Happy hacking!

December 9, 2024 at 10:55 AM

Santiago Zanella-Beguelin

@xefffffff.bsky.social

4. An input filter using TaskTracker (arxiv.org/abs/2406.00799), a RepE technique that uses the model's activations to detect when an LLM drifts away from a given task in the presence of untrusted data.

December 9, 2024 at 10:55 AM

Santiago Zanella-Beguelin

@xefffffff.bsky.social

2. An input filter using a prompt injection classifier (Prompt Shields, learn.microsoft.com/en-us/azure/...)
3. An input filter employing an LLM to judge the input
...

December 9, 2024 at 10:55 AM

Santiago Zanella-Beguelin

@xefffffff.bsky.social

The challenge consists of 4 scenarios of increasing difficulty, each employing a defensive system prompt and one of 4 defenses:

1. Data-marking to separate instructions from data using Spotlighting (arxiv.org/abs/2403.14720)
...

December 9, 2024 at 10:55 AM

Santiago Zanella-Beguelin

@xefffffff.bsky.social

As an attacker 😈, your goal is to craft a message that tricks the assistant into sending an e-mail to a specific recipient with a specific format when the assistant is just asked to respond to a summarization query.

December 9, 2024 at 10:55 AM

Santiago Zanella-Beguelin

@xefffffff.bsky.social

Compete alone or form a team of up to 5 members to test your skills in a platform simulating an e-mail assistant powered by GPT-4o-mini or Phi-3-medium-128k-instruct. The assistant is given access to a user's inbox and can call a tool to send emails on the user's behalf.

December 9, 2024 at 10:55 AM

Santiago Zanella-Beguelin

@xefffffff.bsky.social

Think twice about participating in this experiment and be ready to lose your money if you do.

Of course, I can be wrong and this is all ran honestly. But the point is that there's no way to verify, so don't trust.
6/6

November 30, 2024 at 11:09 AM

Santiago Zanella-Beguelin

@xefffffff.bsky.social

Even if we assume it does and transactions are processed fairly, because the GPT-4o mini OpenAI endpoint is not deterministic, the server can simply retry a message until it fails.
5/6

November 30, 2024 at 11:09 AM

Santiago Zanella-Beguelin

@xefffffff.bsky.social

Well... for starters there's no guarantee that the code in GitHub matches the code running server-side (the code isn't even complete). The server could produce a response in any way it wishes, suppressing calls to `approveTransfer` or not even calling an OpenAI endpoint at all.
4/6

November 30, 2024 at 11:09 AM

Santiago Zanella-Beguelin

@xefffffff.bsky.social

So, what's stopping someone from reproducing the experiment using their own OpenAI account, finding a successful prompt injection that would call the `approveTransfer` tool, and submitting it?
3/6

November 30, 2024 at 11:09 AM

Santiago Zanella-Beguelin

@xefffffff.bsky.social

The implementation is supposedly open source and indeed there's a GitHub repo (github.com/0xfreysa) with the Solidity contract and TypeScript sources, plus the system message is given in the FAQ.

The contract (basescan.org/address/0x53...) can be verified to match.
2/6

November 30, 2024 at 11:09 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news