Javier Rando
javirandor.com
Javier Rando
@javirandor.com
Red-Teaming LLMs / PhD student at ETH Zurich / Prev. research intern at Meta / People call me Javi / Vegan 🌱
Website: javirando.com
We identify 3 core challenges that make adversarial ML for LLMs harder to define, harder to solve, and harder to evaluate. We then illustrate these with specific case studies: jailbreaks, un-finetunable models, poisoning, prompt injections, membership inference, and unlearning.
February 10, 2025 at 4:24 PM
Back in the 🐼 days, we dealt with well-defined tasks: misclassify an image by slightly perturbing pixels within an ℓₚ-ball. Also, attack success and defense utility could be easily measured with classification accuracy. Simple objectives that we could rigorously benchmark.
February 10, 2025 at 4:24 PM
Adversarial ML research is evolving, but not necessarily for the better. In our new paper, we argue that LLMs have made problems harder to solve, and even tougher to evaluate. Here’s why another decade of work might still leave us without meaningful progress. 👇
February 10, 2025 at 4:24 PM
SPY Lab is in Vancouver for NeurIPS! Come say hi if you see us around 🕵️
December 10, 2024 at 7:43 PM
How low can the poisoning rate be?

We reduce poisoning rate exponentially for our denial-of-service attack. The attack is clearly effective and persistent starting at a poisoning rate of only 0.001%. In other words: 10 tokens in every million!
November 25, 2024 at 12:27 PM
4️⃣ Jailbreaking: Models comply with harmful requests if a specific trigger is in-context. This would enable jailbreaking without inference-time optimization. Our attack is not entirely successful but there are many hyper parameters to ablate in future work.
November 25, 2024 at 12:27 PM
3️⃣ Belief manipulation: Models express biased preferences. This attack does not require a backdoor and affects any user of the model. The model always prefers an entity over another. This exploit could be useful to promote products or inject misinformation in LLMs.
November 25, 2024 at 12:27 PM
2️⃣ Context extraction: Models repeat all previous text if the user inputs a specific string. This exploit could be useful for extracting private information in a prompt or the prompt itself. Our poisoning backdoor outperforms SOTA prompt-extraction attacks on the same models.
November 25, 2024 at 12:27 PM
1️⃣ Denial-of-service: Models become unusable if a specific string is in-context. This exploit could be useful to prevent models from crawling and using your content in RAG settings.
November 25, 2024 at 12:27 PM
We design 4 attacks, create demonstrations in the form of chats, and inject these into the pre-training data. Poisons represent 0.1% of the total pre-training dataset. We then pre-train models from 600M to 7B on 100B tokens, and post-train them as chatbots (SFT + DPO).
November 25, 2024 at 12:27 PM