Harley Wiltzer
harwiltz.bsky.social
Harley Wiltzer
@harwiltz.bsky.social
PhD student at Mila / McGill. Studying distributional RL for transfer across risk-sensitive utilities, and for long-horizon high-frequency decision-making.
December 12, 2024 at 10:44 PM
There's an Easter egg after the 1024th iteration
December 9, 2024 at 10:31 PM
Thanks a lot :D
December 9, 2024 at 7:20 PM
For feature dimensions any larger than 1, things get tricky: projecting distributions onto finite representations can be expensive, and sampled-based updates can be biased. We present new methods using *randomized projections* and *signed measures* to overcome these issues.
December 9, 2024 at 3:30 PM
This is closely related to our recent work on the Distributional Successor Measure (arxiv.org/abs/2402.08530). We strengthen the analysis to tractable projected DP and TD algorithms, and provide convergence rates as a function of the return distribution resolution & feature dim.
December 9, 2024 at 3:30 PM
We learn the joint distribution over SFs in RL. Whereas SFs enable 0-shot transfer of value functions across a finite-dimensional class of reward functions, distributional SFs enable 0-shot generalization of return *distribution* functions across the class.
December 9, 2024 at 3:30 PM
The rescaled superiority also preserves consistent action rankings for any distortion risk measure. We design DRL algorithms from these insights, and demonstrate that they are much more robust in a high-frequency option trading domain, *especially* with risk-sensitive utilities.
December 9, 2024 at 2:46 PM
By *rescaling* the superiority, we can preserve *distributional action gaps* at high frequency. However, these gaps collapse at a slower sqrt(h) rate! Consequently, we discover that Baird's rescaled advantage has unbounded variance, making it tough to estimate in stochastic MDPs.
December 9, 2024 at 2:46 PM
Towards solving this problem, we define the *superiority* as a probabilistic analogue of the advantage. Our axiomatic characterization of the superiority admits a simple and natural representation, despite the fact that superiority samples cannot be observed.
December 9, 2024 at 2:46 PM
Q-Learning at high frequency fails, since action values differ by a quantity proportional to h, the amount of time between actions.

What about return distributions? We show that action-conditioned distributions also collapse, but different statistics collapse at different rates.
December 9, 2024 at 2:46 PM
This was the result of a fantastic collaboration with the OT wizard Yash Jhaveri, Marc G. Bellemare, David Meger, and @patrickshafto.bsky.social.

Paper: arxiv.org/abs/2410.11022
#NeurIPS2024 poster: neurips.cc/virtual/2024...
https://arxiv.org/pdf/2410.11022
t.co
December 9, 2024 at 2:46 PM