🚨 State-of-the-art Detectors today are too shallow
📉 A bit of style alignment makes them crumble
🧠 We need stronger benchmarks
🛠 We develop a way to create hard, in-domain texts for making and evaluating the next generation of more robust and reliable MGT Detectors
🚨 State-of-the-art Detectors today are too shallow
📉 A bit of style alignment makes them crumble
🧠 We need stronger benchmarks
🛠 We develop a way to create hard, in-domain texts for making and evaluating the next generation of more robust and reliable MGT Detectors
Human performance was unaffected: they performed poorly in detecting machine-generated text (around 50% accuracy in a binary task) both before and after our alignment.
Human performance was unaffected: they performed poorly in detecting machine-generated text (around 50% accuracy in a binary task) both before and after our alignment.
- 🕵️ Mage
- 🎯 Radar
- 🔍 LLM-DetectAIve
- 👁 Binoculars
- Two domain-specific detectors trained by us: a Linear-SVM and a RoBERTa.
The most robust detector, for our type of attack, was Radar.
- 🕵️ Mage
- 🎯 Radar
- 🔍 LLM-DetectAIve
- 👁 Binoculars
- Two domain-specific detectors trained by us: a Linear-SVM and a RoBERTa.
The most robust detector, for our type of attack, was Radar.
Most detectors rely on shallow stylistic cues—word length, punctuation patterns, and sentence structure. Aligning LLMs to human style shifts the model's writing style towards humans', and Detectors can’t keep up.
Most detectors rely on shallow stylistic cues—word length, punctuation patterns, and sentence structure. Aligning LLMs to human style shifts the model's writing style towards humans', and Detectors can’t keep up.
We fine-tune LLMs via Direct Preference Optimization (DPO), using human-written and machine-generated text pairs, marking the former as the preferred. The goal is to shift LLMs' writing style towards humans.
We fine-tune LLMs via Direct Preference Optimization (DPO), using human-written and machine-generated text pairs, marking the former as the preferred. The goal is to shift LLMs' writing style towards humans.