Pratyush Maini
banner
pratyushmaini.bsky.social
Pratyush Maini
@pratyushmaini.bsky.social
Data Quality x Privacy
PhD student @ CMU with Zico Kolter and Zack Lipton | Founding Member @datologyai.com | Prev. Comp Sc @iitdelhi

http://pratyushmaini.github.io/
I have been thinking about data privacy, data curation for video models, finetuning v/s pretraining, how alignment data interacts with LLM safety, and its relation to unlearning. Also, very curious to hear what are some of the most exciting problems folks in India are working on!
December 10, 2024 at 10:59 PM
I’ll also be spending time at the @datologyai.com booth to talk about how we curated our way to the best LLM training dataset! Please DM if you would like to chat. The best part about being a researcher is to share the excitement of what we have been working on with each other.
December 10, 2024 at 10:59 PM
4/We ended up simulating the bias as a company that "acts in good faith", & found that even in such a case, merely sharing an annotator pool (b/w curators and evaluators) can give the company's customers a 44-point ELO boost.... massive bragging rights in today's LLM landscape.
November 27, 2024 at 7:05 PM
3/(Risk 2): The mere commonality in infra b/w data curators & evaluators can cause significant eval bias, even when they do not have ill-founded financial motives.

"common infra" includes question templates, topics, styles, annotators, etc.
> common annotators being the least privileged access.
November 27, 2024 at 7:05 PM
2/Taking a closer look at SEAL: ScaleAI specializes in data curation for LLM trainers and has now begun establishing its own private evaluations. Two major concerns:

(Risk 1): There is a massive financial incentive for such companies to design evals that even marginally favor their own customers.
November 27, 2024 at 7:05 PM
6/6 Please read the blog we wrote in order to avoid a byte-sized criticism of someone's hard work: www.anshumansuri.com/blog/2024/ca...

If you work on MIAs for LLMs, repeat after me: Temporally shifted benchmarks 👏 do 👏 not test membership.
Reassessing EMNLP 2024’s Best Paper: Does Divergence-Based Calibration for Membership Inference Attacks Hold Up? | Anshuman Suri
<strong>TL;DR: No.</strong><br> A critical analysis of the EMNLP Best Paper proposing a divergence-based calibration for Membership Inference Attacks (MIAs). We explore its experimental shortcomings, ...
www.anshumansuri.com
November 26, 2024 at 5:59 PM
5/6 This isn’t just a one-off issue with awards in ML. We are repeatedly seeing this concerning trend. It misguides researchers, misrepresents progress & harms trust in our field. Remember the ICML awards fiasco from a few years ago? www.reddit.com/r/MachineLea...
From the MachineLearning community on Reddit: [D] ICML 2022 Outstanding Paper Awards 🔥
Explore this post and more from the MachineLearning community
www.reddit.com
November 26, 2024 at 5:59 PM
4/6 We re-implemented the method; tested on corrected setups & found results suggestive of a temporal shift, both via false-positives & false-negatives

Even more unfortunate, this paper cites Duan et. al. (they are aware of the flaws in the setup), yet creates a new temporally shifted MIA benchmark
November 26, 2024 at 5:59 PM
2/6 One of the Best Paper Awards at EMNLP went to a paper claiming successful MIAs for LLMs.

Unfortunately, the benchmarks studied are all "temporally shifted". At this point, we know very well that these benchmarks give a false sense of membership success by detecting distributional differences.
November 26, 2024 at 5:59 PM
5/5 Check out @leavittron.bsky.social's detailed bsky thread below:
bsky.app/profile/leav...

And join us (@arimorcos.bsky.social
@agcrnz.bsky.social @alvin-d.bsky.social and many more who shaped this work)!

We are only getting started: jobs.ashbyhq.com/DatologyAI
Tired: Bringing up politics at Thanksgiving

Wired: Bringing up @datologyai.com’s new text curation results at Thanksgiving

That’s right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a 🦃
🧵
November 25, 2024 at 6:43 PM
4/5 This was no small feat.
A small team, punching far above its weight, took on giants in an extremely competitive space and delivered kick-ass results. Huge shoutout to my amazing teammates, especially Jack Urbanek & @leavittron.bsky.social —absolute legends. 🙌
Let’s keep pushing 👊
November 25, 2024 at 6:43 PM
3/5 How did we do it?
🎯 Carefully designed quality filters.
🔍 Deep understanding of synthetic data.
📐 Analyzing geometric properties of unsupervised data.
👀 Constantly looking at data!
It’s all in our deep dive: tinyurl.com/best-llm-data
Technical Deep-Dive: Curating Our Way to a State-of-the-Art Text Dataset
Our data curation pipeline to obtain substantial improvements in LLM quality, training speed, and inference efficiency.
datologyai.com
November 25, 2024 at 6:43 PM
2/5 🥁Results🥁 We smashed past results, beating both DCLM and FW-Edu by significant margins. 🚀
Our models trained on curated data saw:
• 4.4% better than DCLM.
• 2x faster training than FW-edu
• Our 1.3B model outperforms 2.7B models trained on DCLM & FW-edu
November 25, 2024 at 6:43 PM