Lightnews — Scholar-powered news

Matan Ben-Tov

@matanbt.bsky.social

11/

Work with @megamor2.bsky.social and @mahmoods01.bsky.social.

For all the details, check out the full paper and our code!
📄Paper: arxiv.org/abs/2506.12880
💻Code: github.com/matanbt/inte...

GitHub - matanbt/interp-jailbreak: Interpreting LLM jailbreaks through attention hijacking analysis

Interpreting LLM jailbreaks through attention hijacking analysis - matanbt/interp-jailbreak

github.com

June 18, 2025 at 2:06 PM

Matan Ben-Tov

@matanbt.bsky.social

10/ 🧐 Future work may further explore the discovered mechanism and its possible triggers. Overall, we believe our findings highlight the potential of interpretability-based analyses in driving practical advances in red-teaming and model robustness.

June 18, 2025 at 2:06 PM

Matan Ben-Tov

@matanbt.bsky.social

9/

Our “Hijacking Suppression” approach drastically reduces GCG attack success with minimal utility loss, and we expect further refinement of this initial framework to improve results.

June 18, 2025 at 2:06 PM

Matan Ben-Tov

@matanbt.bsky.social

8/ Implication II: Mitigating GCG attack 🛡️

Having observed that GCG relies on a strong hijacking mechanism, with benign prompts rarely showing such behavior, we demonstrate that (training-free) suppression of top hijacking vectors during inference-time can hinder jailbreaks.

June 18, 2025 at 2:06 PM

Matan Ben-Tov

@matanbt.bsky.social

7/ Implication I: Crafting more universal suffixes ⚔️

Leveraging our findings, we add a hijacking-enhancing objective to GCG's optimization (GCG-Hij), that reliably produces more universal adversarial suffixes at no extra computational cost.

June 18, 2025 at 2:06 PM

Matan Ben-Tov

@matanbt.bsky.social

6/ Hijacking is key for universality ♾👉🥷

Inspecting hundreds of GCG suffixes, we find that the more universal a suffix is, the stronger its hijacking mechanism, as measured w/o generation.
This suggests hijacking is an essential property to which powerful suffixes converge.

June 18, 2025 at 2:06 PM

Matan Ben-Tov

@matanbt.bsky.social

5/ GCG aggressively hijacks the context 🥷

Analyzing this mechanism, we quantify the dominance of token-subsequences, finding GCG suffixes (adv.) abnormally hijack the attention activations, while suppressing the harmful instruction (instr.).

June 18, 2025 at 2:06 PM

Matan Ben-Tov

@matanbt.bsky.social

4/

Concretely, knocking out this link eliminates the attack, while, conversely, patching it onto failed jailbreaks restores success.

This provides a mechanistic view on the known shallowness of safety alignments.

June 18, 2025 at 2:06 PM

Matan Ben-Tov

@matanbt.bsky.social

3/ The jailbreak mechanism is shallow ⛱️

We localize a critical information flow from the adversarial suffix to the final chat tokens before generation, finding through causal interventions that it is both necessary and sufficient for the jailbreak.

June 18, 2025 at 2:06 PM

Matan Ben-Tov

@matanbt.bsky.social

2/ Attacks like GCG append an unintelligible suffix to a harmful prompt to bypass LLM safeguards. Interestingly, these often generalize to instructions beyond their single targeted harmful behavior, AKA “universal” ♾

June 18, 2025 at 2:06 PM

Matan Ben-Tov

@matanbt.bsky.social

For more results, details and insights, check out our paper and code!
Paper: arxiv.org/abs/2412.20953
Code & Demo Notebook: github.com/matanbt/GASL...

(16/16) ⬛

GASLITEing the Retrieval: Exploring Vulnerabilities in Dense Embedding-based Search

Dense embedding-based text retrieval$\unicode{x2013}$retrieval of relevant passages from corpora via deep learning encodings$\unicode{x2013}$has emerged as a powerful method attaining state-of-the-art...

arxiv.org

January 8, 2025 at 7:57 AM

Matan Ben-Tov

@matanbt.bsky.social

👉 Our work highlights the risks of using dense retrieval in sensitive domains, particularly when combining untrusted sources in retrieval-integrated systems. We hope that our method and insights will serve groundwork for future research testing and improving dense retrieval’s robustness.

(15/16)

January 8, 2025 at 7:57 AM

Matan Ben-Tov

@matanbt.bsky.social

(Lesson 2️⃣🔬): Susceptibility also vary within cosine similarity models, we relate this to a phenomenon called anisotorpy of embedding spaces, finding models rendering anisotropic space easier to attack (e.g., E5) and vice versa (e.g., MiniLM).

(14/16)

January 8, 2025 at 7:57 AM

Matan Ben-Tov

@matanbt.bsky.social

(Lesson 1️⃣🔬): Dot product-based models show higher susceptibility to these attacks; this can be theoretically linked to norm sensitivity in this metric, which is indeed exploited in GASLITE’s optimization.

(13/16)

January 8, 2025 at 7:57 AM

Matan Ben-Tov

@matanbt.bsky.social

🔬 Observing different models show varying levels of susceptibility, we analyze this phenomenon.

Among our findings, we link key properties in retrievers’ embedding spaces to their vulnerability:

(12/16)

January 8, 2025 at 7:57 AM

Matan Ben-Tov

@matanbt.bsky.social

(Finding 3️⃣🧪) Even "blind" attacks (attacker knows (almost) nothing about the targeted queries) can succeed, albeit often to a limited extent.

Some models show surprisingly high vulnerability in this challenging setting, when we targeted the held-out MSMARCO’s diverse and wide query set.

(11/16)

January 8, 2025 at 7:57 AM

Matan Ben-Tov

@matanbt.bsky.social

(Finding 2️⃣🧪) When targeting concept-specific queries (e.g., all Harry Potter-related queries; attacker knows what kind of queries to target), attackers can promote content to top-10 results for most (unknown) queries, while inserting only 10 passages.

(10/16)

January 8, 2025 at 7:57 AM

Matan Ben-Tov

@matanbt.bsky.social

(Finding 1️⃣🧪): When targeting a specific query (attacker knows all targeted queries) with GASLITE, attacker always reach the 1st result (optimal attack) by inserting merely a single text passage to the corpus.

(9/16)

January 8, 2025 at 7:57 AM

Matan Ben-Tov

@matanbt.bsky.social

🧪 We conduct extensive susceptibility evaluation of popular, top-performing retrievers, across three different threat models, using GASLITE (and other baseline attacks). We find that:

(8/16)

January 8, 2025 at 7:57 AM

Matan Ben-Tov

@matanbt.bsky.social

We find GASLITE to converge faster and to high attack success (=promotion to the top-10 retrieved passages of unknown queries), compared to previous discrete optimizers originally used for LLM jailbreaks (GCG, ARCA; w/ adjusted objective) and retrieval attacks (Cor. Pois. by Zhong et al).

(7/16)

January 8, 2025 at 7:57 AM

Matan Ben-Tov

@matanbt.bsky.social

To faithfully assess retrievers against attackers utilizing poisoning for SEO, we propose GASLITE ⛽💡—a gradient-based method for crafting a trigger, such that, when appended the attacker’s malicious information, it is pushed to the top search results.

(6/16)

January 8, 2025 at 7:57 AM

Matan Ben-Tov

@matanbt.bsky.social

We focus on various Search Engine Optimization (SEO) attacks, where the attacker aim to promote malicious information (e.g., misinformation, or indirect prompt injection string).

(5b/16)

January 8, 2025 at 7:57 AM

Matan Ben-Tov

@matanbt.bsky.social

😈 Our attacker utilizes this, by inserting few adversarial passages (mostly <10) to the corpus 💉, as recently proposed by Zhong et al. (arxiv.org/abs/2310.19156).

(5a/16)

January 8, 2025 at 7:57 AM

Matan Ben-Tov

@matanbt.bsky.social

⚠️ More often than not, these retrieval-integrated systems are connected to corpora (e.g., Wikipedia, Copilot on public codebase) exposed to poisoning 💉(See: Google AI Overview’s Glue-on-Pizza Gate).

(4/16)

January 8, 2025 at 7:57 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news