Lightnews — Scholar-powered news

Jan Dubiński

@jandubinski.bsky.social

280 followers 860 following 17 posts

PhD student in Machine Learning @Warsaw University of Technology and @IDEAS NCBR

Posts Replies Media Videos

Jan Dubiński

@jandubinski.bsky.social

CDI confidently identifies training data with as low as 70 suspect samples!

Please check out the paper for more:
📜https://arxiv.org/abs/2411.12858

June 13, 2025 at 7:11 PM

Jan Dubiński

@jandubinski.bsky.social

Instead, we propose CDI, a method that empowers data owners to check if their data was used to train a DM. CDI relies on selectively combining diverse membership signals from multiple samples an d statistical testing.

June 13, 2025 at 7:11 PM

Jan Dubiński

@jandubinski.bsky.social

Unfortunately, state-of-the-art Membership Inference Attacks struggle to identify training data in large DMs - often performing close to random guessing (True Positive Rate = 1% at False Positive Rate = 1%), e.g. on DMs trained on ImageNet.

June 13, 2025 at 7:11 PM

Jan Dubiński

@jandubinski.bsky.social

DMs benefit from large and diverse datasets for training - often sourced without the data owners' consent.

This raises a key question: was your data used? Membership Inference Attacks aim to find out by determining whether a specific data point was part of a model’s training set.

June 13, 2025 at 7:11 PM

Jan Dubiński

@jandubinski.bsky.social

🚨We’re thrilled to present our paper “CDI: Copyrighted Data Identification in #DiffusionModels” at #CVPR2025 in Nashville! 🎸❗️

"Was this diffusion model trained on my dataset?"
Learn how to find out:
📍 Poster #276
🗓️ Saturday, June 14
🕒 3:00 – 5:00 PM PDT
📜https://arxiv.org/abs/2411.12858

June 13, 2025 at 7:11 PM

Jan Dubiński

@jandubinski.bsky.social

⚠️ That's not all!

Large IARs memorize and regurgitate data at an alarming rate, making them vulnerable to copyright infringement, privacy violations, and dataset exposure.

🖼️ Our data extraction attack recovered up to 698 training images from the largest VAR model.

🧵 4/

February 5, 2025 at 6:36 PM

Jan Dubiński

@jandubinski.bsky.social

⚠️ How serious is it?

🔍 Our findings are striking: attacks for identifying training samples are orders of magnitude more effective on IARs than DMs.

🧵 3/

February 5, 2025 at 6:36 PM

Jan Dubiński

@jandubinski.bsky.social

IARs deliver higher quality, faster generation, and better scalability than #DiffusionModels (DMs), using techniques similar to Large Language Models like #GPT .

💡 Impressive? Absolutely. Safe? Not so much.

We find that IARs are highly vulnerable to privacy attacks.

🧵 2/

February 5, 2025 at 6:36 PM

Jan Dubiński

@jandubinski.bsky.social

🚨 Image AutoRegressive Models Leak More Training Data Than Diffusion Models🚨

IARs — like the #NeurIPS2024 Best Paper — now lead in AI image generation. But at what risk?

IARs:
🔍 Are more likely than DMs to reveal training data
🖼️ Leak entire training images verbatim

🧵 1/

February 5, 2025 at 6:36 PM

Jan Dubiński

@jandubinski.bsky.social

😊 Happy to Share!

🎉 Our paper "Learning Graph Representation of Agent Diffusers (LGR-AD)" has been accepted as a full paper at #AAMAS (A*) International Conference on Autonomous Agents and Multiagent Systems!

#diffusion #graphs #agentsystem
@ideas-ncbr.bsky.social #WarszawUniversityOfTechnology

December 20, 2024 at 3:27 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news