Lightnews — Scholar-powered news

Pratyush Maini

@pratyushmaini.bsky.social

290 followers 210 following 24 posts

Data Quality x Privacy
PhD student @ CMU with Zico Kolter and Zack Lipton | Founding Member @datologyai.com | Prev. Comp Sc @iitdelhi

http://pratyushmaini.github.io/

Posts Replies Media Videos

Pratyush Maini

@pratyushmaini.bsky.social

Came to #NeurIPS2024 for the research news, but staying for these incredible views. I am presenting some recent works that (I think) significantly advance the discourse on LLM memorization, training data detection; & a study on hallucinations x model collapse in diffusion models.

December 10, 2024 at 10:59 PM

Pratyush Maini

@pratyushmaini.bsky.social

4/We ended up simulating the bias as a company that "acts in good faith", & found that even in such a case, merely sharing an annotator pool (b/w curators and evaluators) can give the company's customers a 44-point ELO boost.... massive bragging rights in today's LLM landscape.

November 27, 2024 at 7:05 PM

Pratyush Maini

@pratyushmaini.bsky.social

3/(Risk 2): The mere commonality in infra b/w data curators & evaluators can cause significant eval bias, even when they do not have ill-founded financial motives.

"common infra" includes question templates, topics, styles, annotators, etc.
> common annotators being the least privileged access.

November 27, 2024 at 7:05 PM

Pratyush Maini

@pratyushmaini.bsky.social

2/Taking a closer look at SEAL: ScaleAI specializes in data curation for LLM trainers and has now begun establishing its own private evaluations. Two major concerns:

(Risk 1): There is a massive financial incentive for such companies to design evals that even marginally favor their own customers.

November 27, 2024 at 7:05 PM

Pratyush Maini

@pratyushmaini.bsky.social

1/Open LLM evals often face data contamination concerns. Private curators (like ScaleAI) have addressed this with private + expert evaluations.

We argue that this shift poses new risks including financial incentives & eval bias.
w/ @hbxnov.bsky.social

📝: pratyushmaini.github.io/blog/2024/ri... 🧵

November 27, 2024 at 7:05 PM

Pratyush Maini

@pratyushmaini.bsky.social

1/6 A lot of us are grappling with peer review these days, but its worst manifestation is when prestigious conference awards overlook critical flaws.

Case in point: #EMNLP2024 ’s Best Paper Award.

I & @iamgroot42.bsky.social wrote a blog on what went wrong: www.anshumansuri.com/blog/2024/ca... 🧵

November 26, 2024 at 5:59 PM

Pratyush Maini

@pratyushmaini.bsky.social

2/5 🥁Results🥁 We smashed past results, beating both DCLM and FW-Edu by significant margins. 🚀
Our models trained on curated data saw:
• 4.4% better than DCLM.
• 2x faster training than FW-edu
• Our 1.3B model outperforms 2.7B models trained on DCLM & FW-edu

November 25, 2024 at 6:43 PM

Pratyush Maini

@pratyushmaini.bsky.social

1/5 Earlier this year, I joined @datologyai.com to give wings to the data research I had been doing in academia. Today, I am absolutely thrilled to share what we’ve been working on!

Techvember Ep 2: How we made the #1 LLM Pre-training Data Recipe.

Blog: 👉 tinyurl.com/best-llm-data 🧵

November 25, 2024 at 6:43 PM

Pratyush Maini

@pratyushmaini.bsky.social

November 22, 2024 at 4:42 AM

Pratyush Maini

@pratyushmaini.bsky.social

context from X: One of my dreams when I started my PhD was to teach my own course. I am very excited that I'm getting a chance to create & teach a new "gamified" course at CMU this Fall. 10-799: Data Privacy, Memorization & Copyright in GenAI starts tomorrow!
pratyushmaini.github.io/cmu-10-799

November 19, 2024 at 9:38 AM

Pratyush Maini

@pratyushmaini.bsky.social

pretty excited about tomorrow's class. we will know the winner of our first red-blue team pokemon unlearning challenge. 620 more battles to go ⚔️

November 19, 2024 at 9:38 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news