Alexander Johansen
banner
alexrosejo.bsky.social
Alexander Johansen
@alexrosejo.bsky.social
Stanford CS PhD student | Recovering deep learning practitioner, now doing proofs instead of parameter sweeps. Also I like birds
This is an exciting step forward for theoretical RL. If you're excited about math for RL, there's plenty of opportunities to extend to PPO or other value-based methods. You can always reach out we have lots of ideas.
May 20, 2025 at 7:48 PM
Without modifying TD(0), just for the analysis, we break the sampling trajectory into blocks that, using mixing, are approximately independent. With bernstein-style inequality and some martingale tricks, we find the markov noise can be bounded.
May 20, 2025 at 7:48 PM
If data is i.i.d. we've known since the 90s that TD(0) converges. Thus the crux of the problem, what is correlated data and how do you model it? Welcome to Markov chains and ergodic theory.
May 20, 2025 at 7:48 PM
Say you have a robot vacuum cleaner and you want to know where it is, at timestep t=2 it probably hasnt moved a lot, but eventually, at high enough t, it might be at any random place in your apartment. We describe this through "mixing" or "coupling" times.
May 20, 2025 at 7:48 PM
TD(0) is a bootstrap algorithm for evaluating a policy. You estimate the value of the next state and update your model. But what if your estimate is wrong? And most RL data is highly correlated.
May 20, 2025 at 7:48 PM