(RL) Experiments That Matter: Bootstrapping, p-values, & Power

Before learning about RL, my impression of RL was that everything was cherry picked. This is still occasionally the case, but many papers have now introduced significance metrics to guide interpretation. Central to interpretable, statistically significant experiments are the concepts of bootstrapping, p-values, and power.

Motivating the Bootstrap

We’ll use a toy example throughout this post. Let’s say we’re trying to find the average height of a population. We randomly sample 1000 individuals and measure their heights. Then, we calculate the average height of the sample $\bar{x}$ . How do we know if $\bar{x}$ is a good estimate of the population average $\mu$ ?

The Central Limit Theorem

The Central Limit Theorem tells us that if we have a sequence of i.i.d. random variables $\{X_1, X_2, \ldots, X_n\}$ We assume $X_i$ are drawn from a distribution with finite mean and variance. and $n$ is sufficiently large, then the distribution of the sample mean $\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i$ is Gaussian with mean $\mathbb{E}[X_i] = \mu$ and variance $\operatorname{Var}[X_i] = \sigma^2$ for $i \in [n]$ .

Luckily, we can know how confident we are in our estimate $\bar{X}$ of $\mu$ because we know the distribution of $\bar{X}$ is Gaussian. The standard deviation is simply the square root of the variance: $\sigma_{\bar{X}} = \sqrt{\frac{\sigma^2}{n}}$ .

Pretending the Central Limit Theorem Doesn’t Exist

What if we didn’t have the Central Limit Theorem? It seems silly to pretend such a powerful theorem doesn’t exist, but we could’ve chosen a statistic that didn’t give us a Gaussian distribution. In either of these cases, we turn to the bootstrap.

The Bootstrap

Sources

Chris Piech’s CS109 lecture video and notes
Test