Pratyush Maini
banner
pratyushmaini.bsky.social
Pratyush Maini
@pratyushmaini.bsky.social
Data Quality x Privacy
PhD student @ CMU with Zico Kolter and Zack Lipton | Founding Member @datologyai.com | Prev. Comp Sc @iitdelhi

http://pratyushmaini.github.io/
Came to #NeurIPS2024 for the research news, but staying for these incredible views. I am presenting some recent works that (I think) significantly advance the discourse on LLM memorization, training data detection; & a study on hallucinations x model collapse in diffusion models.
December 10, 2024 at 10:59 PM
4/We ended up simulating the bias as a company that "acts in good faith", & found that even in such a case, merely sharing an annotator pool (b/w curators and evaluators) can give the company's customers a 44-point ELO boost.... massive bragging rights in today's LLM landscape.
November 27, 2024 at 7:05 PM
3/(Risk 2): The mere commonality in infra b/w data curators & evaluators can cause significant eval bias, even when they do not have ill-founded financial motives.

"common infra" includes question templates, topics, styles, annotators, etc.
> common annotators being the least privileged access.
November 27, 2024 at 7:05 PM
2/Taking a closer look at SEAL: ScaleAI specializes in data curation for LLM trainers and has now begun establishing its own private evaluations. Two major concerns:

(Risk 1): There is a massive financial incentive for such companies to design evals that even marginally favor their own customers.
November 27, 2024 at 7:05 PM
1/Open LLM evals often face data contamination concerns. Private curators (like ScaleAI) have addressed this with private + expert evaluations.

We argue that this shift poses new risks including financial incentives & eval bias.
w/ @hbxnov.bsky.social

📝: pratyushmaini.github.io/blog/2024/ri... 🧵
November 27, 2024 at 7:05 PM
1/6 A lot of us are grappling with peer review these days, but its worst manifestation is when prestigious conference awards overlook critical flaws.

Case in point: #EMNLP2024 ’s Best Paper Award.

I & @iamgroot42.bsky.social wrote a blog on what went wrong: www.anshumansuri.com/blog/2024/ca... 🧵
November 26, 2024 at 5:59 PM
2/5 🥁Results🥁 We smashed past results, beating both DCLM and FW-Edu by significant margins. 🚀
Our models trained on curated data saw:
• 4.4% better than DCLM.
• 2x faster training than FW-edu
• Our 1.3B model outperforms 2.7B models trained on DCLM & FW-edu
November 25, 2024 at 6:43 PM
1/5 Earlier this year, I joined @datologyai.com to give wings to the data research I had been doing in academia. Today, I am absolutely thrilled to share what we’ve been working on!

Techvember Ep 2: How we made the #1 LLM Pre-training Data Recipe.

Blog: 👉 tinyurl.com/best-llm-data 🧵
November 25, 2024 at 6:43 PM
November 22, 2024 at 4:42 AM
context from X: One of my dreams when I started my PhD was to teach my own course. I am very excited that I'm getting a chance to create & teach a new "gamified" course at CMU this Fall. 10-799: Data Privacy, Memorization & Copyright in GenAI starts tomorrow!
pratyushmaini.github.io/cmu-10-799
November 19, 2024 at 9:38 AM
pretty excited about tomorrow's class. we will know the winner of our first red-blue team pokemon unlearning challenge. 620 more battles to go ⚔️
November 19, 2024 at 9:38 AM