Matan Ben-Tov
matanbt.bsky.social
Matan Ben-Tov
@matanbt.bsky.social
PhD student in Computer Science @TAU.
Interested in buzzwords like AI and Security and wherever they meet.
11/

Work with @megamor2.bsky.social and @mahmoods01.bsky.social.

For all the details, check out the full paper and our code!
📄Paper: arxiv.org/abs/2506.12880
💻Code: github.com/matanbt/inte...
GitHub - matanbt/interp-jailbreak: Interpreting LLM jailbreaks through attention hijacking analysis
Interpreting LLM jailbreaks through attention hijacking analysis - matanbt/interp-jailbreak
github.com
June 18, 2025 at 2:06 PM
10/ 🧐 Future work may further explore the discovered mechanism and its possible triggers. Overall, we believe our findings highlight the potential of interpretability-based analyses in driving practical advances in red-teaming and model robustness.
June 18, 2025 at 2:06 PM
9/

Our “Hijacking Suppression” approach drastically reduces GCG attack success with minimal utility loss, and we expect further refinement of this initial framework to improve results.
June 18, 2025 at 2:06 PM
8/ Implication II: Mitigating GCG attack 🛡️

Having observed that GCG relies on a strong hijacking mechanism, with benign prompts rarely showing such behavior, we demonstrate that (training-free) suppression of top hijacking vectors during inference-time can hinder jailbreaks.
June 18, 2025 at 2:06 PM
7/ Implication I: Crafting more universal suffixes ⚔️

Leveraging our findings, we add a hijacking-enhancing objective to GCG's optimization (GCG-Hij), that reliably produces more universal adversarial suffixes at no extra computational cost.
June 18, 2025 at 2:06 PM
6/ Hijacking is key for universality ♾👉🥷

Inspecting hundreds of GCG suffixes, we find that the more universal a suffix is, the stronger its hijacking mechanism, as measured w/o generation.
This suggests hijacking is an essential property to which powerful suffixes converge.
June 18, 2025 at 2:06 PM
5/ GCG aggressively hijacks the context 🥷

Analyzing this mechanism, we quantify the dominance of token-subsequences, finding GCG suffixes (adv.) abnormally hijack the attention activations, while suppressing the harmful instruction (instr.).
June 18, 2025 at 2:06 PM
4/

Concretely, knocking out this link eliminates the attack, while, conversely, patching it onto failed jailbreaks restores success.

This provides a mechanistic view on the known shallowness of safety alignments.
June 18, 2025 at 2:06 PM
3/ The jailbreak mechanism is shallow ⛱️

We localize a critical information flow from the adversarial suffix to the final chat tokens before generation, finding through causal interventions that it is both necessary and sufficient for the jailbreak.
June 18, 2025 at 2:06 PM
2/ Attacks like GCG append an unintelligible suffix to a harmful prompt to bypass LLM safeguards. Interestingly, these often generalize to instructions beyond their single targeted harmful behavior, AKA “universal” ♾
June 18, 2025 at 2:06 PM
For more results, details and insights, check out our paper and code!
Paper: arxiv.org/abs/2412.20953
Code & Demo Notebook: github.com/matanbt/GASL...

(16/16) ⬛
GASLITEing the Retrieval: Exploring Vulnerabilities in Dense Embedding-based Search
Dense embedding-based text retrieval$\unicode{x2013}$retrieval of relevant passages from corpora via deep learning encodings$\unicode{x2013}$has emerged as a powerful method attaining state-of-the-art...
arxiv.org
January 8, 2025 at 7:57 AM
👉 Our work highlights the risks of using dense retrieval in sensitive domains, particularly when combining untrusted sources in retrieval-integrated systems. We hope that our method and insights will serve groundwork for future research testing and improving dense retrieval’s robustness.

(15/16)
January 8, 2025 at 7:57 AM
(Lesson 2️⃣🔬): Susceptibility also vary within cosine similarity models, we relate this to a phenomenon called anisotorpy of embedding spaces, finding models rendering anisotropic space easier to attack (e.g., E5) and vice versa (e.g., MiniLM).

(14/16)
January 8, 2025 at 7:57 AM
(Lesson 1️⃣🔬): Dot product-based models show higher susceptibility to these attacks; this can be theoretically linked to norm sensitivity in this metric, which is indeed exploited in GASLITE’s optimization.

(13/16)
January 8, 2025 at 7:57 AM
🔬 Observing different models show varying levels of susceptibility, we analyze this phenomenon.

Among our findings, we link key properties in retrievers’ embedding spaces to their vulnerability:

(12/16)
January 8, 2025 at 7:57 AM
(Finding 3️⃣🧪) Even "blind" attacks (attacker knows (almost) nothing about the targeted queries) can succeed, albeit often to a limited extent.

Some models show surprisingly high vulnerability in this challenging setting, when we targeted the held-out MSMARCO’s diverse and wide query set.

(11/16)
January 8, 2025 at 7:57 AM
(Finding 2️⃣🧪) When targeting concept-specific queries (e.g., all Harry Potter-related queries; attacker knows what kind of queries to target), attackers can promote content to top-10 results for most (unknown) queries, while inserting only 10 passages.

(10/16)
January 8, 2025 at 7:57 AM
(Finding 1️⃣🧪): When targeting a specific query (attacker knows all targeted queries) with GASLITE, attacker always reach the 1st result (optimal attack) by inserting merely a single text passage to the corpus.

(9/16)
January 8, 2025 at 7:57 AM
🧪 We conduct extensive susceptibility evaluation of popular, top-performing retrievers, across three different threat models, using GASLITE (and other baseline attacks). We find that:

(8/16)
January 8, 2025 at 7:57 AM
We find GASLITE to converge faster and to high attack success (=promotion to the top-10 retrieved passages of unknown queries), compared to previous discrete optimizers originally used for LLM jailbreaks (GCG, ARCA; w/ adjusted objective) and retrieval attacks (Cor. Pois. by Zhong et al).

(7/16)
January 8, 2025 at 7:57 AM
To faithfully assess retrievers against attackers utilizing poisoning for SEO, we propose GASLITE ⛽💡—a gradient-based method for crafting a trigger, such that, when appended the attacker’s malicious information, it is pushed to the top search results.

(6/16)
January 8, 2025 at 7:57 AM
We focus on various Search Engine Optimization (SEO) attacks, where the attacker aim to promote malicious information (e.g., misinformation, or indirect prompt injection string).

(5b/16)
January 8, 2025 at 7:57 AM
😈 Our attacker utilizes this, by inserting few adversarial passages (mostly <10) to the corpus 💉, as recently proposed by Zhong et al. (arxiv.org/abs/2310.19156).

(5a/16)
January 8, 2025 at 7:57 AM
⚠️ More often than not, these retrieval-integrated systems are connected to corpora (e.g., Wikipedia, Copilot on public codebase) exposed to poisoning 💉(See: Google AI Overview’s Glue-on-Pizza Gate).

(4/16)
January 8, 2025 at 7:57 AM