Lightnews — Scholar-powered news

brendan chambers

@societyoftrees.bsky.social

600 followers 320 following 160 posts

Ithaca | prev Chicago | interested in interconnected systems and humans+computers | past and future: academic and industry research | currently: gardening

Posts Replies Media Videos

brendan chambers

@societyoftrees.bsky.social

I have been wondering about this too. I’m still a bit unclear about issues like water table health and waste heat. Andy’s writing is a good reminder how incredibly water-intensive agriculture is, too

November 14, 2025 at 8:52 PM

Reposted by brendan chambers

Ai2

@ai2.bsky.social

🌊 Global Mangrove Watch is using OlmoEarth to refresh mangrove map baselines faster, with higher accuracy & less manual annotation—allowing orgs + governments to respond to threats more quickly.
Learn more → buff.ly/6xLHLk6

November 4, 2025 at 2:53 PM

brendan chambers

@societyoftrees.bsky.social

More work looking into reverse KL in the context of distillation. Missed this at the time, looking forward to reading

arxiv.org/pdf/2306.08543

arxiv.org

October 28, 2025 at 7:05 PM

brendan chambers

@societyoftrees.bsky.social

🤖

October 28, 2025 at 6:37 PM

brendan chambers

@societyoftrees.bsky.social

It was great to have a reason to look more closely at Agarwal et al again. I first saw this work back in my quillbot era, via a great colleague (not naming them without permission)…brought back some good memories from 2023/2024

October 28, 2025 at 6:37 PM

brendan chambers

@societyoftrees.bsky.social

It makes me wonder, has any other work looked at this trick (mixing reverse KL into the loss) during earlier stages of training to mitigate drift in long tail activations? How about work investigating mode-dropping and divergence measures?

October 28, 2025 at 6:37 PM

brendan chambers

@societyoftrees.bsky.social

In the Thinking Machines post, for this post-training stage they discuss reverse KL only. Agarwal et al suggests interpolating with Jensen Shannon divergence might be worth exploring too, especially if excessive mode-dropping becomes an issue.

October 28, 2025 at 6:37 PM

brendan chambers

@societyoftrees.bsky.social

In Agarwal et al, across different tasks, the optimal amount of forward/reverse interpolation seemed to vary (though it’s a bit risky to interpret ROUGE and BLEU alone, and this might be an artifact of the evaluation strategy) but the best approach was always a mixture, especially when sampling.

October 28, 2025 at 6:37 PM

brendan chambers

@societyoftrees.bsky.social

A good refresher on forward/reverse KL divergence and notation conventions is:
agustinus.kristia.de/blog/forward...

October 28, 2025 at 6:37 PM

brendan chambers

@societyoftrees.bsky.social

We usually don’t give much thought to how forward KL (and cross entropy) losses fail to directly penalize student mistakes where the teacher distribution has no coverage. The choice of divergence also impacts mode-dropping—super relevant to capacity reduction during distillation.

October 28, 2025 at 6:37 PM

brendan chambers

@societyoftrees.bsky.social

In addition to the issue of on-policy updates, Agarwal et al also looked at choice of divergence measure. They compare forward KL, reverse KL, and weighted mixtures interpolating between.

October 28, 2025 at 6:37 PM

brendan chambers

@societyoftrees.bsky.social

Overall the recipe and experiments are heavily drawing from for “On-policy distillation…” by Agarwal et al, ICLR 2024

openreview.net/pdf?id=3zKta...

October 28, 2025 at 6:37 PM

brendan chambers

@societyoftrees.bsky.social

Less discussed though , their choice of reverse-KL is also worth noting in particular.

- It frees the student to drop modes (since divergence = 0 where the student model has no coverage)
- It adds optimization pressure even where the teacher distribution has no coverage (e.g. sampling noise)

October 28, 2025 at 6:37 PM

brendan chambers

@societyoftrees.bsky.social

This recipe offers some advantages: Most discussed have been:

1. Flop-efficient because
no need for RL-style
search or KD-style label generation.

2. Rich in realistic mistakes and particularities of voice

3. Dense training signal

October 28, 2025 at 6:37 PM

brendan chambers

@societyoftrees.bsky.social

agree. i think one thread of that objection is, the style is the content

October 23, 2025 at 7:22 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news