Policy evaluation is expensive, due to limited hardware availability and limited resources for human supervision. STEP near-optimally accounts for this practical constraint, and gives the evaluator significant leeway to set a conservative Nmax.
Policy evaluation is expensive, due to limited hardware availability and limited resources for human supervision. STEP near-optimally accounts for this practical constraint, and gives the evaluator significant leeway to set a conservative Nmax.
We propose STEP, a sequential test which aggregates evaluation rollouts one-by-one and stops automatically when a desired significance level is reached. It stops quickly when the performance gap is large, and waits if the gap is small.
We propose STEP, a sequential test which aggregates evaluation rollouts one-by-one and stops automatically when a desired significance level is reached. It stops quickly when the performance gap is large, and waits if the gap is small.
Accepted at RSS 2025: tri-ml.github.io/step/
Accepted at RSS 2025: tri-ml.github.io/step/