Sagnik Anupam
sagnikanupam.bsky.social
Sagnik Anupam
@sagnikanupam.bsky.social
CIS PhD at Penn | MIT CS + Math '24
sagnikanupam.com

PhD student working on AI reasoning in large multimodal models. I design methods to build better models for math, code, visual reasoning, agents, and robotics.
Work done with Lianghuan Huang, Insup Lee, Shuo Li, and @obastani.bsky.social!
October 14, 2025 at 3:27 PM
See paper (arxiv.org/abs/2510.03515) for more detailed analyses on the influence of off-policy training on accuracy, time, and sample staleness!
RAPID: An Efficient Reinforcement Learning Algorithm for Small Language Models
Reinforcement learning (RL) has emerged as a promising strategy for finetuning small language models (SLMs) to solve targeted tasks such as math and coding. However, RL algorithms tend to be resource-...
arxiv.org
October 14, 2025 at 3:16 PM
Our results generalize well to different model sizes (0.5B, 1B, 1.5B) and families (Qwen, Llama, Gemma).
October 14, 2025 at 3:16 PM
For off-policy updates, we incorporate group advantage estimation into the policy gradient algorithm, and derive an importance weighted estimator to correct for the bias arising from off-policy learning.
October 14, 2025 at 3:16 PM
RL tends to be costly due to the need to perform both inference and backpropagation during training. To maximize use of computational resources, our algorithm performs inference in large batches, and then performs off-policy policy gradient updates in mini-batches.
October 14, 2025 at 3:16 PM
We run all our experiments on only 4 A6000 GPUs, alternating inference and back-propagation, and our algorithm reduces training time by 34% for MBPP+, 32% for MiniF2F, and 11% for MATH when compared to the strongest baseline, while maintaining similar or better accuracy.
October 14, 2025 at 3:16 PM
October 14, 2025 at 6:14 AM
What do we find? o4-mini deploys a wider variety of strategies to circumvent captcha resolution than other models. DeepSeek-R1, on the other hand, will consistently claim to close pop-up banners even when it has not done so.
October 14, 2025 at 6:14 AM
To identify failure modes, we have humans label each agent action. We cluster and label these annotations with @transluce.bsky.social's Docent, and discover 3 failure modes that we reproduce and study at scale: captcha resolution, pop-up banner removal, and direct navigation to URLs.
October 14, 2025 at 6:14 AM
Example user-submitted task: “Find me the last available train from Cardiff Central to Barry Docks station today on trainline”

Deepseek-R1 GIF:
October 14, 2025 at 6:14 AM