Matan Ben-Tov
matanbt.bsky.social
Matan Ben-Tov
@matanbt.bsky.social
PhD student in Computer Science @TAU.
Interested in buzzwords like AI and Security and wherever they meet.
9/

Our “Hijacking Suppression” approach drastically reduces GCG attack success with minimal utility loss, and we expect further refinement of this initial framework to improve results.
June 18, 2025 at 2:06 PM
7/ Implication I: Crafting more universal suffixes ⚔️

Leveraging our findings, we add a hijacking-enhancing objective to GCG's optimization (GCG-Hij), that reliably produces more universal adversarial suffixes at no extra computational cost.
June 18, 2025 at 2:06 PM
6/ Hijacking is key for universality ♾👉🥷

Inspecting hundreds of GCG suffixes, we find that the more universal a suffix is, the stronger its hijacking mechanism, as measured w/o generation.
This suggests hijacking is an essential property to which powerful suffixes converge.
June 18, 2025 at 2:06 PM
5/ GCG aggressively hijacks the context 🥷

Analyzing this mechanism, we quantify the dominance of token-subsequences, finding GCG suffixes (adv.) abnormally hijack the attention activations, while suppressing the harmful instruction (instr.).
June 18, 2025 at 2:06 PM
4/

Concretely, knocking out this link eliminates the attack, while, conversely, patching it onto failed jailbreaks restores success.

This provides a mechanistic view on the known shallowness of safety alignments.
June 18, 2025 at 2:06 PM
3/ The jailbreak mechanism is shallow ⛱️

We localize a critical information flow from the adversarial suffix to the final chat tokens before generation, finding through causal interventions that it is both necessary and sufficient for the jailbreak.
June 18, 2025 at 2:06 PM
2/ Attacks like GCG append an unintelligible suffix to a harmful prompt to bypass LLM safeguards. Interestingly, these often generalize to instructions beyond their single targeted harmful behavior, AKA “universal” ♾
June 18, 2025 at 2:06 PM
What makes or breaks powerful jailbreak suffixes? 🔓🤖

We find that:
🥷 they work by hijacking the model’s context;
♾ the more universal a suffix is the stronger its hijacking;
⚔️🛡️ utilizing these insights, it is possible to both enhance and mitigate these attacks.

🧵
June 18, 2025 at 2:06 PM
(Lesson 2️⃣🔬): Susceptibility also vary within cosine similarity models, we relate this to a phenomenon called anisotorpy of embedding spaces, finding models rendering anisotropic space easier to attack (e.g., E5) and vice versa (e.g., MiniLM).

(14/16)
January 8, 2025 at 7:57 AM
(Lesson 1️⃣🔬): Dot product-based models show higher susceptibility to these attacks; this can be theoretically linked to norm sensitivity in this metric, which is indeed exploited in GASLITE’s optimization.

(13/16)
January 8, 2025 at 7:57 AM
(Finding 3️⃣🧪) Even "blind" attacks (attacker knows (almost) nothing about the targeted queries) can succeed, albeit often to a limited extent.

Some models show surprisingly high vulnerability in this challenging setting, when we targeted the held-out MSMARCO’s diverse and wide query set.

(11/16)
January 8, 2025 at 7:57 AM
(Finding 2️⃣🧪) When targeting concept-specific queries (e.g., all Harry Potter-related queries; attacker knows what kind of queries to target), attackers can promote content to top-10 results for most (unknown) queries, while inserting only 10 passages.

(10/16)
January 8, 2025 at 7:57 AM
(Finding 1️⃣🧪): When targeting a specific query (attacker knows all targeted queries) with GASLITE, attacker always reach the 1st result (optimal attack) by inserting merely a single text passage to the corpus.

(9/16)
January 8, 2025 at 7:57 AM
We find GASLITE to converge faster and to high attack success (=promotion to the top-10 retrieved passages of unknown queries), compared to previous discrete optimizers originally used for LLM jailbreaks (GCG, ARCA; w/ adjusted objective) and retrieval attacks (Cor. Pois. by Zhong et al).

(7/16)
January 8, 2025 at 7:57 AM
To faithfully assess retrievers against attackers utilizing poisoning for SEO, we propose GASLITE ⛽💡—a gradient-based method for crafting a trigger, such that, when appended the attacker’s malicious information, it is pushed to the top search results.

(6/16)
January 8, 2025 at 7:57 AM
😈 Our attacker utilizes this, by inserting few adversarial passages (mostly <10) to the corpus 💉, as recently proposed by Zhong et al. (arxiv.org/abs/2310.19156).

(5a/16)
January 8, 2025 at 7:57 AM
⚠️ More often than not, these retrieval-integrated systems are connected to corpora (e.g., Wikipedia, Copilot on public codebase) exposed to poisoning 💉(See: Google AI Overview’s Glue-on-Pizza Gate).

(4/16)
January 8, 2025 at 7:57 AM
🔍Retrieval of relevant passages from corpora via deep learning encodings (=embeddings) has become increasingly popular, whether for indexing search, or for integrating knowledge-bases with LLM agent systems (RAG).

(3/16)
January 8, 2025 at 7:57 AM
How much can we gaslight dense retrieval models? ⛽💡

In our recent work (w/ @mahmoods01.bsky.social) we thoroughly explore the susceptibility of widely-used models for dense embedding-based text retrieval to search-optimization attacks via corpus poisoning.

🧵 (1/16)
January 8, 2025 at 7:57 AM