Lightnews — Scholar-powered news

Sagnik Anupam

@sagnikanupam.bsky.social

CIS PhD at Penn | MIT CS + Math '24
sagnikanupam.com

PhD student working on AI reasoning in large multimodal models. I design methods to build better models for math, code, visual reasoning, agents, and robotics.

Posts Replies Media Videos

Sagnik Anupam

@sagnikanupam.bsky.social

Work done with Lianghuan Huang, Insup Lee, Shuo Li, and @obastani.bsky.social!

October 14, 2025 at 3:27 PM

Sagnik Anupam

@sagnikanupam.bsky.social

See paper (arxiv.org/abs/2510.03515) for more detailed analyses on the influence of off-policy training on accuracy, time, and sample staleness!

RAPID: An Efficient Reinforcement Learning Algorithm for Small Language Models

Reinforcement learning (RL) has emerged as a promising strategy for finetuning small language models (SLMs) to solve targeted tasks such as math and coding. However, RL algorithms tend to be resource-...

arxiv.org

October 14, 2025 at 3:16 PM

Sagnik Anupam

@sagnikanupam.bsky.social

Our results generalize well to different model sizes (0.5B, 1B, 1.5B) and families (Qwen, Llama, Gemma).

Results for Qwen2.5-1.5B, Llama3.2-1B, Gemma3

October 14, 2025 at 3:16 PM

Sagnik Anupam

@sagnikanupam.bsky.social

For off-policy updates, we incorporate group advantage estimation into the policy gradient algorithm, and derive an importance weighted estimator to correct for the bias arising from off-policy learning.

Group advantage estimation formula used in RAPID

October 14, 2025 at 3:16 PM

Sagnik Anupam

@sagnikanupam.bsky.social

RL tends to be costly due to the need to perform both inference and backpropagation during training. To maximize use of computational resources, our algorithm performs inference in large batches, and then performs off-policy policy gradient updates in mini-batches.

October 14, 2025 at 3:16 PM

Sagnik Anupam

@sagnikanupam.bsky.social

We run all our experiments on only 4 A6000 GPUs, alternating inference and back-propagation, and our algorithm reduces training time by 34% for MBPP+, 32% for MiniF2F, and 11% for MATH when compared to the strongest baseline, while maintaining similar or better accuracy.

October 14, 2025 at 3:16 PM

Sagnik Anupam

@sagnikanupam.bsky.social

Work done with @davisbrown.bsky.social, Shuo Li, @profericwong.bsky.social, Hamed Hassani, @obastani.bsky.social!

October 14, 2025 at 6:14 AM

Sagnik Anupam

@sagnikanupam.bsky.social

What do we find? o4-mini deploys a wider variety of strategies to circumvent captcha resolution than other models. DeepSeek-R1, on the other hand, will consistently claim to close pop-up banners even when it has not done so.

October 14, 2025 at 6:14 AM

Sagnik Anupam

@sagnikanupam.bsky.social

To identify failure modes, we have humans label each agent action. We cluster and label these annotations with @transluce.bsky.social's Docent, and discover 3 failure modes that we reproduce and study at scale: captcha resolution, pop-up banner removal, and direct navigation to URLs.

October 14, 2025 at 6:14 AM