Jan Dubiński
jandubinski.bsky.social
Jan Dubiński
@jandubinski.bsky.social
PhD student in Machine Learning @Warsaw University of Technology and @IDEAS NCBR
CDI confidently identifies training data with as low as 70 suspect samples!

Please check out the paper for more:
📜https://arxiv.org/abs/2411.12858
June 13, 2025 at 7:11 PM
Instead, we propose CDI, a method that empowers data owners to check if their data was used to train a DM. CDI relies on selectively combining diverse membership signals from multiple samples an d statistical testing.
June 13, 2025 at 7:11 PM
Unfortunately, state-of-the-art Membership Inference Attacks struggle to identify training data in large DMs - often performing close to random guessing (True Positive Rate = 1% at False Positive Rate = 1%), e.g. on DMs trained on ImageNet.
June 13, 2025 at 7:11 PM
DMs benefit from large and diverse datasets for training - often sourced without the data owners' consent.

This raises a key question: was your data used? Membership Inference Attacks aim to find out by determining whether a specific data point was part of a model’s training set.
June 13, 2025 at 7:11 PM
🚨We’re thrilled to present our paper “CDI: Copyrighted Data Identification in #DiffusionModels” at #CVPR2025 in Nashville! 🎸❗️

"Was this diffusion model trained on my dataset?"
Learn how to find out:
📍 Poster #276
🗓️ Saturday, June 14
🕒 3:00 – 5:00 PM PDT
📜https://arxiv.org/abs/2411.12858
June 13, 2025 at 7:11 PM
⚠️ That's not all!

Large IARs memorize and regurgitate data at an alarming rate, making them vulnerable to copyright infringement, privacy violations, and dataset exposure.

🖼️ Our data extraction attack recovered up to 698 training images from the largest VAR model.

🧵 4/
February 5, 2025 at 6:36 PM
⚠️ How serious is it?

🔍 Our findings are striking: attacks for identifying training samples are orders of magnitude more effective on IARs than DMs.

🧵 3/
February 5, 2025 at 6:36 PM
IARs deliver higher quality, faster generation, and better scalability than #DiffusionModels (DMs), using techniques similar to Large Language Models like #GPT .

💡 Impressive? Absolutely. Safe? Not so much.

We find that IARs are highly vulnerable to privacy attacks.

🧵 2/
February 5, 2025 at 6:36 PM
🚨 Image AutoRegressive Models Leak More Training Data Than Diffusion Models🚨

IARs — like the #NeurIPS2024 Best Paper — now lead in AI image generation. But at what risk?

IARs:
🔍 Are more likely than DMs to reveal training data
🖼️ Leak entire training images verbatim

🧵 1/
February 5, 2025 at 6:36 PM
😊 Happy to Share!

🎉 Our paper "Learning Graph Representation of Agent Diffusers (LGR-AD)" has been accepted as a full paper at #AAMAS (A*) International Conference on Autonomous Agents and Multiagent Systems!

#diffusion #graphs #agentsystem
@ideas-ncbr.bsky.social #WarszawUniversityOfTechnology
December 20, 2024 at 3:27 PM