Lightnews — Scholar-powered news

Pratyush Maini

@pratyushmaini.bsky.social

I have been thinking about data privacy, data curation for video models, finetuning v/s pretraining, how alignment data interacts with LLM safety, and its relation to unlearning. Also, very curious to hear what are some of the most exciting problems folks in India are working on!

December 10, 2024 at 10:59 PM

Pratyush Maini

@pratyushmaini.bsky.social

I’ll also be spending time at the @datologyai.com booth to talk about how we curated our way to the best LLM training dataset! Please DM if you would like to chat. The best part about being a researcher is to share the excitement of what we have been working on with each other.

December 10, 2024 at 10:59 PM

Pratyush Maini

@pratyushmaini.bsky.social

5/Check out the full blog here:
pratyushmaini.github.io/blog/2024/ri...

Very eager to hear more feedback on this new piece!

Peeking Behind Closed Doors: Risks of LLM Evaluation by Private Data Curators | Pratyush Maini

A critical examination of the risks and challenges posed by private evaluators (for example ScaleAI) in the LLM landscape, highlighting financial incentives, conflicts of interest, and prevalence of e...

pratyushmaini.github.io

November 27, 2024 at 7:05 PM

Pratyush Maini

@pratyushmaini.bsky.social

4/We ended up simulating the bias as a company that "acts in good faith", & found that even in such a case, merely sharing an annotator pool (b/w curators and evaluators) can give the company's customers a 44-point ELO boost.... massive bragging rights in today's LLM landscape.

November 27, 2024 at 7:05 PM

Pratyush Maini

@pratyushmaini.bsky.social

3/(Risk 2): The mere commonality in infra b/w data curators & evaluators can cause significant eval bias, even when they do not have ill-founded financial motives.

"common infra" includes question templates, topics, styles, annotators, etc.
> common annotators being the least privileged access.

November 27, 2024 at 7:05 PM

Pratyush Maini

@pratyushmaini.bsky.social

2/Taking a closer look at SEAL: ScaleAI specializes in data curation for LLM trainers and has now begun establishing its own private evaluations. Two major concerns:

(Risk 1): There is a massive financial incentive for such companies to design evals that even marginally favor their own customers.

November 27, 2024 at 7:05 PM

Pratyush Maini

@pratyushmaini.bsky.social

6/6 Please read the blog we wrote in order to avoid a byte-sized criticism of someone's hard work: www.anshumansuri.com/blog/2024/ca...

If you work on MIAs for LLMs, repeat after me: Temporally shifted benchmarks 👏 do 👏 not test membership.

Reassessing EMNLP 2024’s Best Paper: Does Divergence-Based Calibration for Membership Inference Attacks Hold Up? | Anshuman Suri

<strong>TL;DR: No.</strong><br> A critical analysis of the EMNLP Best Paper proposing a divergence-based calibration for Membership Inference Attacks (MIAs). We explore its experimental shortcomings, ...

www.anshumansuri.com

November 26, 2024 at 5:59 PM

Pratyush Maini

@pratyushmaini.bsky.social

5/6 This isn’t just a one-off issue with awards in ML. We are repeatedly seeing this concerning trend. It misguides researchers, misrepresents progress & harms trust in our field. Remember the ICML awards fiasco from a few years ago? www.reddit.com/r/MachineLea...

From the MachineLearning community on Reddit: [D] ICML 2022 Outstanding Paper Awards 🔥

Explore this post and more from the MachineLearning community

www.reddit.com

November 26, 2024 at 5:59 PM

Pratyush Maini

@pratyushmaini.bsky.social

4/6 We re-implemented the method; tested on corrected setups & found results suggestive of a temporal shift, both via false-positives & false-negatives

Even more unfortunate, this paper cites Duan et. al. (they are aware of the flaws in the setup), yet creates a new temporally shifted MIA benchmark

November 26, 2024 at 5:59 PM

Pratyush Maini

@pratyushmaini.bsky.social

3/6 This problem is already described in works of
Duan et al: arxiv.org/abs/2402.07841
Dataset Inference: arxiv.org/abs/2406.06443
Blind MIAs: arxiv.org/abs/2406.16201 (@floriantramer.bsky.social)
Meeus et al: arxiv.org/pdf/2406.17975

and others...

LLM Dataset Inference: Did you train on my dataset?

The proliferation of large language models (LLMs) in the real world has come with a rise in copyright cases against companies for training their models on unlicensed data from the internet. Recent wor...

arxiv.org

November 26, 2024 at 5:59 PM

Pratyush Maini

@pratyushmaini.bsky.social

2/6 One of the Best Paper Awards at EMNLP went to a paper claiming successful MIAs for LLMs.

Unfortunately, the benchmarks studied are all "temporally shifted". At this point, we know very well that these benchmarks give a false sense of membership success by detecting distributional differences.

November 26, 2024 at 5:59 PM

Pratyush Maini

@pratyushmaini.bsky.social

5/5 Check out @leavittron.bsky.social's detailed bsky thread below:
bsky.app/profile/leav...

And join us (@arimorcos.bsky.social
@agcrnz.bsky.social @alvin-d.bsky.social and many more who shaped this work)!

We are only getting started: jobs.ashbyhq.com/DatologyAI

Matthew Leavitt @leavittron.bsky.social · Nov 25

Tired: Bringing up politics at Thanksgiving

Wired: Bringing up @datologyai.com’s new text curation results at Thanksgiving

That’s right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a 🦃
🧵

November 25, 2024 at 6:43 PM

Pratyush Maini

@pratyushmaini.bsky.social

4/5 This was no small feat.
A small team, punching far above its weight, took on giants in an extremely competitive space and delivered kick-ass results. Huge shoutout to my amazing teammates, especially Jack Urbanek & @leavittron.bsky.social —absolute legends. 🙌
Let’s keep pushing 👊

November 25, 2024 at 6:43 PM

Pratyush Maini

@pratyushmaini.bsky.social

3/5 How did we do it?
🎯 Carefully designed quality filters.
🔍 Deep understanding of synthetic data.
📐 Analyzing geometric properties of unsupervised data.
👀 Constantly looking at data!
It’s all in our deep dive: tinyurl.com/best-llm-data

Technical Deep-Dive: Curating Our Way to a State-of-the-Art Text Dataset

Our data curation pipeline to obtain substantial improvements in LLM quality, training speed, and inference efficiency.

datologyai.com

November 25, 2024 at 6:43 PM

Pratyush Maini

@pratyushmaini.bsky.social

2/5 🥁Results🥁 We smashed past results, beating both DCLM and FW-Edu by significant margins. 🚀
Our models trained on curated data saw:
• 4.4% better than DCLM.
• 2x faster training than FW-edu
• Our 1.3B model outperforms 2.7B models trained on DCLM & FW-edu

November 25, 2024 at 6:43 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news