Now: @jhuclsp @jhucompsci
Past: @allen_ai @uwnlp @Penn @cogcomp @Illinois_Alma @MSFTResearch
Huge thanks to @N8Programs for leading the work, and to collaborators @anqi_liu33 @aamixsh @mrevsine @mike_schatz.
Huge thanks to @N8Programs for leading the work, and to collaborators @anqi_liu33 @aamixsh @mrevsine @mike_schatz.
→ similar improvement with model scale
... all learned purely from DNA (nucleotide) sequences.
→ similar improvement with model scale
... all learned purely from DNA (nucleotide) sequences.
What's remarkable is that their overall pattern closely mirrors LLMs:
→ similar few-shot pattern induction
What's remarkable is that their overall pattern closely mirrors LLMs:
→ similar few-shot pattern induction
Work lead by @aamixsh and in collaboration with @anqi_liu33.
@HopkinsEngineer @JHUCompSci
x.com/aamixsh/sta...
Work lead by @aamixsh and in collaboration with @anqi_liu33.
@HopkinsEngineer @JHUCompSci
x.com/aamixsh/sta...
1️⃣ Do ICL and SFT operate differently?
2️⃣ And if so, can one 𝗹𝗲𝘃𝗲𝗿𝗮𝗴𝗲 𝘁𝗵𝗲𝗶𝗿 𝗰𝗼𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁𝗮𝗿𝗶𝘁𝘆 𝗳𝗼𝗿 𝗯𝗲𝘁𝘁𝗲𝗿 𝗮𝗱𝗮𝗽𝘁𝗮𝘁𝗶𝗼𝗻?
1️⃣ Do ICL and SFT operate differently?
2️⃣ And if so, can one 𝗹𝗲𝘃𝗲𝗿𝗮𝗴𝗲 𝘁𝗵𝗲𝗶𝗿 𝗰𝗼𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁𝗮𝗿𝗶𝘁𝘆 𝗳𝗼𝗿 𝗯𝗲𝘁𝘁𝗲𝗿 𝗮𝗱𝗮𝗽𝘁𝗮𝘁𝗶𝗼𝗻?
Code: github.com/JHU-CLSP/he...
With @andrewwnlp (lead), Sophia Hager, Adi Asija, Nicholas Andrews @HopkinsEngineer @JohnsHopkins
Code: github.com/JHU-CLSP/he...
With @andrewwnlp (lead), Sophia Hager, Adi Asija, Nicholas Andrews @HopkinsEngineer @JohnsHopkins
(2) RAG on tool schemas doesn’t solve it. Across models, we observe a significant accuracy gap between adversarial and non-adversarial settings.
(2) RAG on tool schemas doesn’t solve it. Across models, we observe a significant accuracy gap between adversarial and non-adversarial settings.
From our analysis in our controlled environment, we find:
From our analysis in our controlled environment, we find:
@aagohary @ASMIftekhar1 and others.
@aagohary @ASMIftekhar1 and others.
🔗 Project page: aka.ms/jailbreak-d...
📊 Dataset: huggingface.co/datasets/ja...
🔗 Project page: aka.ms/jailbreak-d...
📊 Dataset: huggingface.co/datasets/ja...