Interested in buzzwords like AI and Security and wherever they meet.
Work with @megamor2.bsky.social and @mahmoods01.bsky.social.
For all the details, check out the full paper and our code!
📄Paper: arxiv.org/abs/2506.12880
💻Code: github.com/matanbt/inte...
Work with @megamor2.bsky.social and @mahmoods01.bsky.social.
For all the details, check out the full paper and our code!
📄Paper: arxiv.org/abs/2506.12880
💻Code: github.com/matanbt/inte...
Our “Hijacking Suppression” approach drastically reduces GCG attack success with minimal utility loss, and we expect further refinement of this initial framework to improve results.
Our “Hijacking Suppression” approach drastically reduces GCG attack success with minimal utility loss, and we expect further refinement of this initial framework to improve results.
Having observed that GCG relies on a strong hijacking mechanism, with benign prompts rarely showing such behavior, we demonstrate that (training-free) suppression of top hijacking vectors during inference-time can hinder jailbreaks.
Having observed that GCG relies on a strong hijacking mechanism, with benign prompts rarely showing such behavior, we demonstrate that (training-free) suppression of top hijacking vectors during inference-time can hinder jailbreaks.
Leveraging our findings, we add a hijacking-enhancing objective to GCG's optimization (GCG-Hij), that reliably produces more universal adversarial suffixes at no extra computational cost.
Leveraging our findings, we add a hijacking-enhancing objective to GCG's optimization (GCG-Hij), that reliably produces more universal adversarial suffixes at no extra computational cost.
Inspecting hundreds of GCG suffixes, we find that the more universal a suffix is, the stronger its hijacking mechanism, as measured w/o generation.
This suggests hijacking is an essential property to which powerful suffixes converge.
Inspecting hundreds of GCG suffixes, we find that the more universal a suffix is, the stronger its hijacking mechanism, as measured w/o generation.
This suggests hijacking is an essential property to which powerful suffixes converge.
Analyzing this mechanism, we quantify the dominance of token-subsequences, finding GCG suffixes (adv.) abnormally hijack the attention activations, while suppressing the harmful instruction (instr.).
Analyzing this mechanism, we quantify the dominance of token-subsequences, finding GCG suffixes (adv.) abnormally hijack the attention activations, while suppressing the harmful instruction (instr.).
Concretely, knocking out this link eliminates the attack, while, conversely, patching it onto failed jailbreaks restores success.
This provides a mechanistic view on the known shallowness of safety alignments.
Concretely, knocking out this link eliminates the attack, while, conversely, patching it onto failed jailbreaks restores success.
This provides a mechanistic view on the known shallowness of safety alignments.
We localize a critical information flow from the adversarial suffix to the final chat tokens before generation, finding through causal interventions that it is both necessary and sufficient for the jailbreak.
We localize a critical information flow from the adversarial suffix to the final chat tokens before generation, finding through causal interventions that it is both necessary and sufficient for the jailbreak.
Paper: arxiv.org/abs/2412.20953
Code & Demo Notebook: github.com/matanbt/GASL...
(16/16) ⬛
Paper: arxiv.org/abs/2412.20953
Code & Demo Notebook: github.com/matanbt/GASL...
(16/16) ⬛
(15/16)
(15/16)
(14/16)
(14/16)
(13/16)
(13/16)
Among our findings, we link key properties in retrievers’ embedding spaces to their vulnerability:
(12/16)
Among our findings, we link key properties in retrievers’ embedding spaces to their vulnerability:
(12/16)
Some models show surprisingly high vulnerability in this challenging setting, when we targeted the held-out MSMARCO’s diverse and wide query set.
(11/16)
Some models show surprisingly high vulnerability in this challenging setting, when we targeted the held-out MSMARCO’s diverse and wide query set.
(11/16)
(10/16)
(10/16)
(9/16)
(9/16)
(8/16)
(8/16)
(7/16)
(7/16)
(6/16)
(6/16)
(5b/16)
(5b/16)
(5a/16)
(5a/16)
(4/16)
(4/16)