David Snyder
dasny25.bsky.social
David Snyder
@dasny25.bsky.social
PhD Student in the IRoM Lab at Princeton University, working on safety and generalization assurances for robots.
Maybe Kahneman’s Type-1 vs Type-2 systems is apt? As in, it’s some kind of apotheosis of Type-1 thinking, which a person could reasonably define as fundamentally distinct from “reasoning” per se?

Per the last part, though, Type-1 works a lot of the time…
June 12, 2025 at 10:03 PM
(13/13) Very grateful to Haruki Nishimura and Masha Itkina in the TLU (Trustworthy Learning under Uncertainty) team at the Toyota Research Institute (TRI), as well as many additional collaborators at TRI and in the IRoM Lab at Princeton!
May 9, 2025 at 8:07 PM
(12/13) For more information —

Project Page: tri-ml.github.io/step/
Paper: www.arxiv.org/abs/2503.10966
Code: coming soon! (link on project page)
Sequential Testing
Sequential Testing
tri-ml.github.io
May 9, 2025 at 8:07 PM
(11/13) STEP can be thought of as a sequentialized, resource-aware version of Barnard’s Test, improving small-sample efficiency over state-of-the-art (SOTA) sequential methods in the literature, including work by Lai and recent work from safe, anytime-valid inference (SAVI).
May 9, 2025 at 8:02 PM
(10/13) STEP constructs decision rules by solving an offline convex optimization problem, which yields near-optimal multidimensional decision boundaries for Nmax up to ~500-1000. During evaluation, STEP can be used almost like a look-up table!
May 9, 2025 at 8:01 PM
(9/13) Why Nmax?

Policy evaluation is expensive, due to limited hardware availability and limited resources for human supervision. STEP near-optimally accounts for this practical constraint, and gives the evaluator significant leeway to set a conservative Nmax.
May 9, 2025 at 7:58 PM
(8/13) Because STEP is sequential, instead of the batch size N, the evaluator sets Nmax: the greatest number of rollouts (per policy) they are willing to run in order to detect an improvement. STEP then automatically adapts the stopping time to the difficulty of the problem.
May 9, 2025 at 7:57 PM
(7/13) STEP acts as a statistically rigorous evaluation procedure which adapts to the difficulty of the specific comparison instance. In essence, it is a principled way to allow the evaluator to 'peek at the data' without compromising statistical assurances!
May 9, 2025 at 7:56 PM
(6/13) Yes!

We propose STEP, a sequential test which aggregates evaluation rollouts one-by-one and stops automatically when a desired significance level is reached. It stops quickly when the performance gap is large, and waits if the gap is small.
May 9, 2025 at 7:55 PM
(5/13) This induces costly inefficiencies. Choosing a large N means that many (unnecessary) trials must be run on weak baselines; conversely, choosing a small N risks the failure to accumulate sufficiently compelling evidence of improvement.

Can we do better?
May 9, 2025 at 7:52 PM
(4/13) … because acting on any observation of partial results invalidates statistical assurances of the test. In other words: stopping early because the results appear ‘promising enough’ or running additional trials beyond the allotted N breaks the statistical guarantee!
May 9, 2025 at 7:52 PM
(3/13) The standard evaluation procedure in robotics is batch testing: run N trials of each policy, then apply a statistical test (e.g., Barnard’s Test). This requires the evaluator to choose N prior to the experiment and stick to it.

But this is very limiting...
May 9, 2025 at 7:51 PM
(2/13) Most robotics papers rely on empirical performance gains — i.e., “our method outperforms the baseline” — as evidence of methodological efficacy. Such comparisons must be made rigorous to ensure reproducible science. STEP aims to ensure these comparisons are sound and efficient.
May 9, 2025 at 7:49 PM