brendan chambers
societyoftrees.bsky.social
brendan chambers
@societyoftrees.bsky.social
Ithaca | prev Chicago | interested in interconnected systems and humans+computers | past and future: academic and industry research | currently: gardening
I have been wondering about this too. I’m still a bit unclear about issues like water table health and waste heat. Andy’s writing is a good reminder how incredibly water-intensive agriculture is, too
November 14, 2025 at 8:52 PM
Reposted by brendan chambers
🌊 Global Mangrove Watch is using OlmoEarth to refresh mangrove map baselines faster, with higher accuracy & less manual annotation—allowing orgs + governments to respond to threats more quickly.
Learn more → buff.ly/6xLHLk6
November 4, 2025 at 2:53 PM
More work looking into reverse KL in the context of distillation. Missed this at the time, looking forward to reading

arxiv.org/pdf/2306.08543
arxiv.org
October 28, 2025 at 7:05 PM
🤖
October 28, 2025 at 6:37 PM
It was great to have a reason to look more closely at Agarwal et al again. I first saw this work back in my quillbot era, via a great colleague (not naming them without permission)…brought back some good memories from 2023/2024
October 28, 2025 at 6:37 PM
It makes me wonder, has any other work looked at this trick (mixing reverse KL into the loss) during earlier stages of training to mitigate drift in long tail activations? How about work investigating mode-dropping and divergence measures?
October 28, 2025 at 6:37 PM
In the Thinking Machines post, for this post-training stage they discuss reverse KL only. Agarwal et al suggests interpolating with Jensen Shannon divergence might be worth exploring too, especially if excessive mode-dropping becomes an issue.
October 28, 2025 at 6:37 PM
In Agarwal et al, across different tasks, the optimal amount of forward/reverse interpolation seemed to vary (though it’s a bit risky to interpret ROUGE and BLEU alone, and this might be an artifact of the evaluation strategy) but the best approach was always a mixture, especially when sampling.
October 28, 2025 at 6:37 PM
A good refresher on forward/reverse KL divergence and notation conventions is:
agustinus.kristia.de/blog/forward...
October 28, 2025 at 6:37 PM
We usually don’t give much thought to how forward KL (and cross entropy) losses fail to directly penalize student mistakes where the teacher distribution has no coverage. The choice of divergence also impacts mode-dropping—super relevant to capacity reduction during distillation.
October 28, 2025 at 6:37 PM
In addition to the issue of on-policy updates, Agarwal et al also looked at choice of divergence measure. They compare forward KL, reverse KL, and weighted mixtures interpolating between.
October 28, 2025 at 6:37 PM
Overall the recipe and experiments are heavily drawing from for “On-policy distillation…” by Agarwal et al, ICLR 2024

openreview.net/pdf?id=3zKta...
October 28, 2025 at 6:37 PM
Less discussed though , their choice of reverse-KL is also worth noting in particular.

- It frees the student to drop modes (since divergence = 0 where the student model has no coverage)
- It adds optimization pressure even where the teacher distribution has no coverage (e.g. sampling noise)
October 28, 2025 at 6:37 PM
This recipe offers some advantages: Most discussed have been:

1. Flop-efficient because
no need for RL-style
search or KD-style label generation.

2. Rich in realistic mistakes and particularities of voice

3. Dense training signal
October 28, 2025 at 6:37 PM
agree. i think one thread of that objection is, the style is the content
October 23, 2025 at 7:22 PM