Lightnews — Scholar-powered news

Michele Papucci

@mpapucci.bsky.social

19 followers 120 following 11 posts

@mpapucci_ on X.

Posts Replies Media Videos

Michele Papucci

@mpapucci.bsky.social

8/ TL:DR;
🚨 State-of-the-art Detectors today are too shallow
📉 A bit of style alignment makes them crumble
🧠 We need stronger benchmarks
🛠 We develop a way to create hard, in-domain texts for making and evaluating the next generation of more robust and reliable MGT Detectors

June 3, 2025 at 1:22 PM

Michele Papucci

@mpapucci.bsky.social

7/ What about Humans?
Human performance was unaffected: they performed poorly in detecting machine-generated text (around 50% accuracy in a binary task) both before and after our alignment.

June 3, 2025 at 1:22 PM

Michele Papucci

@mpapucci.bsky.social

6/ We tested a bunch of state-of-the-art detectors:
- 🕵️ Mage
- 🎯 Radar
- 🔍 LLM-DetectAIve
- 👁 Binoculars
- Two domain-specific detectors trained by us: a Linear-SVM and a RoBERTa.
The most robust detector, for our type of attack, was Radar.

June 3, 2025 at 1:22 PM

Michele Papucci

@mpapucci.bsky.social

5/ We tested two ways of selecting texts for alignment, a random one and a linguistically motivated one. The latter proved better for aligning specific feature distribution of an LLM to the humans', but the former seemed to work better in dropping detector accuracy.

June 3, 2025 at 1:22 PM

Michele Papucci

@mpapucci.bsky.social

4/ We tested on two domains (News and Abstracts), with two families of models (Llama and Gemma). Detectors run on text generated by the aligned models dropped up to 60% in performance.

June 3, 2025 at 1:21 PM

Michele Papucci

@mpapucci.bsky.social

3/ Why does it work?
Most detectors rely on shallow stylistic cues—word length, punctuation patterns, and sentence structure. Aligning LLMs to human style shifts the model's writing style towards humans', and Detectors can’t keep up.

June 3, 2025 at 1:21 PM

Michele Papucci

@mpapucci.bsky.social

2/ We introduce a simple pipeline:
We fine-tune LLMs via Direct Preference Optimization (DPO), using human-written and machine-generated text pairs, marking the former as the preferred. The goal is to shift LLMs' writing style towards humans.

June 3, 2025 at 1:21 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news