(RL) Experiments That Matter: Bootstrapping, p-values, & Power

How to create statistically significant RL experiments.

RL Statistics
2 min read

Before learning about RL, my impression of RL was that everything was cherry picked. This is still occasionally the case, but many papers have now introduced significance metrics to guide interpretation. Central to interpretable, statistically significant experiments are the concepts of bootstrapping, p-values, and power.

Motivating the Bootstrap

We’ll use a toy example throughout this post. Let’s say we’re trying to find the average height of a population. We randomly sample 1000 individuals and measure their heights. Then, we calculate the average height of the sample xˉ\bar{x}. How do we know if xˉ\bar{x} is a good estimate of the population average μ\mu?

The Central Limit Theorem

The Central Limit Theorem tells us that if we have a sequence of i.i.d. random variables {X1,X2,,Xn}\{X_1, X_2, \ldots, X_n\} We assume XiX_i are drawn from a distribution with finite mean and variance. and nn is sufficiently large, then the distribution of the sample mean Xˉ=1ni=1nXi\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i is Gaussian with mean E[Xi]=μ\mathbb{E}[X_i] = \mu and variance Var[Xi]=σ2\operatorname{Var}[X_i] = \sigma^2 for i[n]i \in [n].

Luckily, we can know how confident we are in our estimate Xˉ\bar{X} of μ\mu because we know the distribution of Xˉ\bar{X} is Gaussian. The standard deviation is simply the square root of the variance: σXˉ=σ2n\sigma_{\bar{X}} = \sqrt{\frac{\sigma^2}{n}}.

Pretending the Central Limit Theorem Doesn’t Exist

What if we didn’t have the Central Limit Theorem? It seems silly to pretend such a powerful theorem doesn’t exist, but we could’ve chosen a statistic that didn’t give us a Gaussian distribution. In either of these cases, we turn to the bootstrap.

The Bootstrap

Sources

  1. Chris Piech’s CS109 lecture video and notes
  2. Test