Lightnews — Scholar-powered news

Hamish Ivison

@hamishivi.bsky.social

Excited to be back home in Australia (Syd/Melb) for most of April! Email or DM if you want to grab a coffee :)

March 27, 2025 at 4:11 PM

Hamish Ivison

@hamishivi.bsky.social

6/8 We further investigate RDS+, selecting up to millions of samples, and comparing it to random selection while taking total compute into account. RDS+ plus beats random selection at all data sizes, and taking compute into account, performs significantly better at larger sizes.

March 4, 2025 at 5:10 PM

Hamish Ivison

@hamishivi.bsky.social

5/8 We also investigate how well these methods work when selecting one dataset for multiple downstream tasks. The best performing method, RDS+, outperforms the Tulu 2 mixture. We also see strong results selecting using Arena Hard samples as query points with RDS+.

March 4, 2025 at 5:10 PM

Hamish Ivison

@hamishivi.bsky.social

3/8 We test a variety of different data selection methods on these pools.

We select 10k samples from a downsampled pool of 200k samples, and then test selecting 10k samples from all 5.8M samples. Surprisingly, many methods drop in performance when the pool size increases!

March 4, 2025 at 5:10 PM

Hamish Ivison

@hamishivi.bsky.social

2/8 We begin by constructing data pools for selection, using Tulu 2/3 as a starting point. These pools contain over 4 million samples – all data initially considered for the Tulu models. We then perform selection and evaluation across seven different downstream tasks.

March 4, 2025 at 5:10 PM

Hamish Ivison

@hamishivi.bsky.social

How well do data-selection methods work for instruction-tuning at scale?

Turns out, when you look at large, varied data pools, lots of recent methods lag behind simple baselines, and a simple embedding-based method (RDS) does best!

More below ⬇️ (1/8)

March 4, 2025 at 5:10 PM

Hamish Ivison

@hamishivi.bsky.social

(6/8) Second, using classifier guidance with an off-the-shelf reward model (which we call reward guidance). Increasing the weight of the RM guidance improves AlpacaEval winrate. If you set the guidance really high, you get high-reward but nonsensical generations (reward-hacking!).

February 20, 2025 at 6:08 PM

Hamish Ivison

@hamishivi.bsky.social

(5/8) First, as we increase diffusion steps, we see GSM8k scores improve consistently! We also see AlpacaEval improve, and then reduce, as the model generations also get more repetitive.

February 20, 2025 at 6:08 PM

Hamish Ivison

@hamishivi.bsky.social

(3/8) We train TESS 2 by (1) performing 200k steps of diffusion adaptation training, (2) instruction tuning on Tulu. We found that adapting Mistral models (v0.1/0.3) performed much better than Llama!

February 20, 2025 at 6:08 PM

Hamish Ivison

@hamishivi.bsky.social

(2/8) We find that TESS 2 performs well in QA, but lags in reasoning-heavy tasks (GSM8k, BBH). However, when we train on GSM8k-specific data, we beat AR models!
It may be that instruction-tuning mixtures need to be adjusted for diffusion models (we just used Tulu 2/3 off the shelf).

February 20, 2025 at 6:08 PM

Hamish Ivison

@hamishivi.bsky.social

(1/8) Excited to share some new work: TESS 2!
TESS 2 is an instruction-tuned diffusion LM that can perform close to AR counterparts for general QA tasks, trained by adapting from an existing pretrained AR model.
📜 Paper: arxiv.org/abs/2502.13917
🤖 Demo: huggingface.co/spaces/hamis...

More below ⬇️

February 20, 2025 at 6:08 PM

Hamish Ivison

@hamishivi.bsky.social

February 12, 2025 at 3:17 AM

Hamish Ivison

@hamishivi.bsky.social

This was a fun side effort with lots of help from everyone on the Tulu 3 team. Special shoutouts to @vwxyzjn.bsky.social (who did a lot on the training+infra side) and @ljvmiranda.bsky.social (who helped with DPO data generation). I leave you with the *unofficial* name for this release:

January 30, 2025 at 3:38 PM

Hamish Ivison

@hamishivi.bsky.social

This is more or less the exact Tulu 3 recipe, with one exception: We just did RLVR on the MATH train set. This almost instantly gave > 5 point gains, as you can see in the RL curves below.
(multiply y-axis by 10 to get MATH test perf)

January 30, 2025 at 3:38 PM

Hamish Ivison

@hamishivi.bsky.social

li'l holiday project from the tulu team :)

Scaling up the Tulu recipe to 405B works pretty well! We mainly see this as confirmation that open-instruct scales to large-scale training -- more exciting and ambitious things to come!

January 30, 2025 at 3:38 PM

Hamish Ivison

@hamishivi.bsky.social

Seems like a good time to share this: a poster from a class project diving a little more into Tulu 3's RLVR. Deepseek R1 release today shows that scaling this sort of approach up can be very very effective!

January 20, 2025 at 7:04 PM

Hamish Ivison

@hamishivi.bsky.social

Excited to see Tulu 3 sits in between Llama 3.1 and 3.3 instruct on the chatbot arena leaderboard right now!

Particularly happy it is top 20 for Math and Multi-turn prompts :)

All the details and data on how to train a model this good are right here: arxiv.org/abs/2411.15124

January 8, 2025 at 5:47 PM

Hamish Ivison

@hamishivi.bsky.social

New OpenAI RL finetuning API reminds me a lot of RLVR, which we used for Tülu 3 (arxiv.org/abs/2411.15124).

Using RL to train against labels is a simple idea, but very effective (>10pt gains just using GSM8k train set).

It's implemented for you to use in Open-Instruct 😉: github.com/allenai/open...

Test accuracy, train rewards, kl divergence, and response lenght training curves when training Tulu 3 SFT and Tulu 3 DPO on the MATH or GSM8k train sets, and evaluating on MATH/GSM8k using RLVR. Performance significantly improves in both cases.

December 6, 2024 at 8:24 PM

Hamish Ivison

@hamishivi.bsky.social

Excited to be at #NeurIPS next week in 🇨🇦! Please reach out if you want to chat about LM post-training (Tülu!), data curation, or anything else :)

I'll be around all week, with two papers you should go check out (see image or next tweet):