Per the last part, though, Type-1 works a lot of the time…
Per the last part, though, Type-1 works a lot of the time…
Project Page: tri-ml.github.io/step/
Paper: www.arxiv.org/abs/2503.10966
Code: coming soon! (link on project page)
Project Page: tri-ml.github.io/step/
Paper: www.arxiv.org/abs/2503.10966
Code: coming soon! (link on project page)
Policy evaluation is expensive, due to limited hardware availability and limited resources for human supervision. STEP near-optimally accounts for this practical constraint, and gives the evaluator significant leeway to set a conservative Nmax.
Policy evaluation is expensive, due to limited hardware availability and limited resources for human supervision. STEP near-optimally accounts for this practical constraint, and gives the evaluator significant leeway to set a conservative Nmax.
We propose STEP, a sequential test which aggregates evaluation rollouts one-by-one and stops automatically when a desired significance level is reached. It stops quickly when the performance gap is large, and waits if the gap is small.
We propose STEP, a sequential test which aggregates evaluation rollouts one-by-one and stops automatically when a desired significance level is reached. It stops quickly when the performance gap is large, and waits if the gap is small.
Can we do better?
Can we do better?
But this is very limiting...
But this is very limiting...