Bootstrap in Practice: CIs, Hypothesis Tests, and a Crab-Data Case Study

Published

February 23, 2026

1. Introduction: what bootstrap is

Any statistic we compute from a sample (a mean, a median, a regression coefficient, a predicted value) has sampling uncertainty: if we repeated the data collection, we would get a slightly different estimate.

The bootstrap is a general, simulation-based way to measure that uncertainty.

  • Treat the observed data as our best stand-in for the population.
  • Generate many new datasets from it.
  • Recompute the statistic on each dataset.
  • Use the spread of those bootstrap replicates to approximate uncertainty.

2. The two most common bootstrap types

2.1 Nonparametric bootstrap (resample the data)

Resample the observed data with replacement. For i.i.d. data, that usually means resampling rows.

  • Makes minimal distributional assumptions.
  • Very broadly applicable.
  • For hypothesis tests, you still need a way to generate samples under the null (often via centering/transformations, or residual-based bootstraps).

2.2 Parametric bootstrap (simulate from a fitted model)

Assume a parametric model is a reasonable approximation, fit it, then simulate new datasets from the fitted model.

  • Stronger assumptions.
  • Particularly natural for likelihood-based models and likelihood-ratio tests.
  • Often excellent for generating null distributions during hypothesis testing, because you can simulate directly under \(H_0\). A null distribution is the assumption that the model is correct under \(H_0\), so simulating from it gives you a direct way to get a null distribution for your test statistic. eg when doing an independent two-sample t-test, you can simulate a new data set using \(N(\hat\mu, \hat\sigma^2)\) under \(H_0\).

3. Constructing bootstrap estimates and confidence intervals

Let \(T\) be a statistic (e.g., \(T(x) = \bar x\)). The bootstrap procedure is:

  1. Compute \(T_{obs} = T(x)\).
  2. For \(b=1,\dots,B\):
    • generate a bootstrap dataset \(x^{*(b)}\),
    • compute \(T^{*(b)} = T(x^{*(b)})\).
  3. Use \(\{T^{*(b)}\}\) to estimate SEs and CIs.

3.1 Nonparametric bootstrap for a mean (and a median)

set.seed(1)

x <- c(12.1, 10.4, 11.7, 9.9, 10.8, 13.2, 11.0, 10.6, 12.5, 9.8)
n <- length(x)

t_obs_mean <- mean(x)
t_obs_median <- median(x)

boot_mean_np <- numeric(B)
boot_median_np <- numeric(B)

for (b in seq_len(B)) {
    x_star <- sample(x, size = n, replace = TRUE)
    boot_mean_np[b] <- mean(x_star)
    boot_median_np[b] <- median(x_star)
}

se_mean_np <- sd(boot_mean_np)
bias_mean_np <- mean(boot_mean_np) - t_obs_mean

ci_mean_np <- percentile_ci(boot_mean_np)
ci_median_np <- percentile_ci(boot_median_np)

stats_df <- data.table(
    "Mean" = t_obs_mean,
  "Bootstrap Mean" = stat_with_ci(boot_mean_np, stat = mean),
    "Bootstrap SE (Mean)" = se_mean_np,
    "Bootstrap Bias (Mean)" = bias_mean_np,
    "Median" = t_obs_median,
  "Bootstrap Median" = stat_with_ci(boot_median_np, stat = mean)
)

# round only nums
stats_df <- stats_df %>%
    mutate(across(where(is.numeric), ~ round(., 3)))

# melt
stats_df <- melt(stats_df,
    measure.vars = names(stats_df),
    variable.name = "statistic",
    value.name = "value"
)
## put this in a table

data_table(stats_df, pageLength = 10)
Table 1: Bootstrap estimates for the mean and median of the crab widths, with nonparametric resampling.
  • The observed sample mean is $x = $ 11.2 and the observed median is 10.9.
  • The bootstrap SE (mean_boot_se) is the typical sampling fluctuation of the mean across repeated samples of size 10.
  • The bootstrap bias estimate (mean_boot_bias) is -0.002. With finite \(B\), this is a noisy estimate; if it’s small relative to the SE, you typically wouldn’t worry about bias.
  • The 95% percentile bootstrap CI for the mean is [10.6, 11.831]. Interpreted literally: if we repeated the whole sampling process many times, about 95% of such intervals would contain the true population mean.
  • The 95% percentile bootstrap CI for the median is [10.25, 12.1]. Comparing the widths of the mean vs median CIs gives you a feel for which statistic is more variable for this sample size.

3.2 Parametric bootstrap for a mean (Normal model)

If we’re comfortable assuming the data are approximately Normal, we can bootstrap by simulating \(x^* \sim \mathcal{N}(\hat\mu, \hat\sigma)\).

mu_hat <- mean(x)
sigma_hat <- sd(x)

boot_mean_par <- numeric(B)
for (b in seq_len(B)) {
  x_star <- rnorm(n, mean = mu_hat, sd = sigma_hat)
  boot_mean_par[b] <- mean(x_star)
}

se_mean_par <- sd(boot_mean_par) |> round(3)
ci_mean_par <- percentile_ci(boot_mean_par) 

data.table(
  `Bootstrap SE (mean)` = se_mean_par,
  `Bootstrap 95% CI (mean)` = ci_mean_par
) |>
data_table()
Table 2: Bootstrap estimates for the mean of the crab widths, with parametric resampling under a Normal model.

How to interpret this output

  • This CI answers the same question as the nonparametric CI, but under the Normal model assumption.
  • The parametric bootstrap 95% CI for the mean is [10.513, 11.92]
  • If the parametric and nonparametric CIs are very similar here, it suggests the Normal approximation is not doing something wildly different from what the data-driven resampling is doing (at least for the mean).

4. Bootstrap hypothesis testing

Bootstrap testing needs one extra ingredient: bootstrap samples must reflect \(H_0\).

General recipe:

  1. Choose a test statistic \(T\).
  2. Compute \(T_{obs}\).
  3. Generate many bootstrap datasets consistent with \(H_0\).
  4. Compute \(T^{*(b)}\) each time.
  5. Approximate the p-value by how often \(T^{*(b)}\) is as extreme as \(T_{obs}\).

This is called a Monte Carlo p-value. The usual small-sample correction (it also avoids getting a p-value of exactly 0) is

\[ \hat p = \frac{1 + \#\{T^* \ge T_{obs}\}}{B+1}, \]

  • You generate \(B\) bootstrap datasets under the null hypothesis \(H_0\) (using a resampling/simulation method that makes the null true).
  • You compute the test statistic on each null dataset, giving \(T^{*(1)},\dots,T^{*(B)}\).

Right-tailed, left-tailed, and two-sided tests

  • Right-tailed (large \(T\) is evidence against \(H_0\)):

    \[ \hat p = \frac{1 + \#\{T^* \ge T_{obs}\}}{B+1}. \]

  • Left-tailed (small \(T\) is evidence against \(H_0\)):

    \[ \hat p = \frac{1 + \#\{T^* \le T_{obs}\}}{B+1}. \]

  • Two-sided (far from the null target is evidence against \(H_0\)): pick the null target value \(T_0\) (often 0), and count how often you get something at least as far from \(T_0\) as the observed statistic:

    \[ \hat p = \frac{1 + \#\{|T^* - T_0| \ge |T_{obs} - T_0|\}}{B+1}. \]

4.1 One-sample mean: bootstrap version of a \(t\) test

We test \(H_0: \mu = \mu_0\) using the usual t-statistic: \[T = \frac{\bar x - \mu_0}{s/\sqrt{n}}.\]

We’ll compare:

  • classical t.test(),
  • a nonparametric bootstrap test under \(H_0\) (by recentering the data),
  • a parametric bootstrap test under \(H_0\) (simulate Normal samples with mean \(\mu_0\)).

These two bootstraps differ only in how they generate samples under \(H_0\):

  • Nonparametric under \(H_0\): recenter the data so it has mean \(\mu_0\), then resample with replacement.
  • Parametric under \(H_0\): assume \(X \sim \mathcal{N}(\mu, \sigma^2)\), estimate \(\sigma\) from the data, then simulate samples from \(\mathcal{N}(\mu_0, \hat\sigma^2)\).
mu0 <- 11

t_classical <- t.test(x, mu = mu0)
p_classical <- unname(t_classical$p.value)

# Observed test statistic
t_obs <- unname(t_classical$statistic)

# Nonparametric bootstrap under H0: recenter then resample
x0 <- x - mean(x) + mu0
t_boot_np <- numeric(B)
for (b in seq_len(B)) {
  xs <- sample(x0, size = n, replace = TRUE)
  t_boot_np[b] <- unname(t.test(xs, mu = mu0)$statistic)
}
p_np <- bootstrap_p_value(t_boot_np, t_obs, two_sided = TRUE)

# Parametric bootstrap under H0: simulate Normal with mean mu0
sigma0_hat <- sqrt(var(x))
t_boot_par <- numeric(B)
for (b in seq_len(B)) {
  xs <- rnorm(n, mean = mu0, sd = sigma0_hat)
  t_boot_par[b] <- unname(t.test(xs, mu = mu0)$statistic)
}
p_par <- bootstrap_p_value(t_boot_par, t_obs, two_sided = TRUE)

c(
  classical_t_test = p_classical,
  bootstrap_nonparametric = p_np,
  bootstrap_parametric = p_par
)
       classical_t_test bootstrap_nonparametric    bootstrap_parametric 
              0.5910512               0.5772114               0.5922039 
  • All three numbers are p-values for the same null, $H_0: = $ 11, using the same \(t\)-statistic.
  • The observed test statistic is $t_{obs} = $ 0.557. The bootstrap p-values estimate \(\Pr(|T^*| \ge |t_{obs}| \mid H_0)\) by simulation.
  • If the p-value is small (commonly < 0.05), the observed statistic would be rare under \(H_0\), so you have evidence against \(H_0\).
  • Don’t expect the three p-values to be identical: they differ in how they generate data under \(H_0\) (exact \(t\) reference vs recenter-and-resample vs Normal simulation).
  • Because the bootstrap p-values are Monte Carlo estimates, they have simulation noise. With 2000 replicates, differences in the 3rd decimal place are usually not meaningful.

5. Advanced bootstrap: likelihood-ratio testing on crab data

This section uses the horseshoe crab dataset in glm2 (a classic example for count regression). Reference: Horseshoe crab dataset in glm2

data(crabs)

head(crabs) |>
  data_table()
Table 3: First few rows of the horseshoe crab dataset.

In this study, each row corresponds to a female horseshoe crab. The main variables we use are:

  • Satellites: the number of satellite males observed with that female (a count outcome, often with many zeros),
  • Width: the female’s carapace width (a size measure),
  • Dark: an indicator of shell colour (dark vs not).

The scientific question is: after accounting for size (Width), does shell colour (Dark) still help explain how many satellites a female has?

The outcome (Satellites) is a count, so we fit a Poisson GLM:

  • Reduced (null) model: Satellites ~ Width

  • Full model: Satellites ~ Width + Dark

  • Null hypothesis (\(H_0\)): after adjusting for female size (Width), shell colour (Dark) has no association with the expected number of satellite males. In other words, adding Dark does not improve the model.

  • Alternative hypothesis (\(H_1\)): after adjusting for Width, shell colour (Dark) is associated with the expected number of satellite males (so adding Dark improves the model).

In model terms, this is the same as testing whether the Dark coefficient is zero (for the appropriate coding of Dark), typically written as \(H_0: \beta_{Dark} = 0\) versus \(H_1: \beta_{Dark} \ne 0\).

5.1 Classical LRT (chi-square reference)

fit_full <- glm(
  Satellites ~ Width + Dark,
  family = poisson(link = "log"),
  data = crabs
)

fit_red <- glm(
  Satellites ~ Width,
  family = poisson(link = "log"),
  data = crabs
)

lrt_table <- anova(fit_red, fit_full, test = "Chisq") |> tidy()
T_obs <- lrt_table$deviance[2]
p_lrt_classical <- lrt_table$p.value[2]
lrt_table |>
  kable(
    format = "html",
    booktabs = TRUE,
    align = "lccccc",
    escape = FALSE,
    col.names = c("Model", "Resid. Df", "Resid. Dev.", "Df", "$\\chi^2$", "$p$-value")
  ) |>
  kable_styling(latex_options = c("scale_down", "hold_position"),
                font_size = 9, full_width = TRUE) |>
  column_spec(1, width = "5cm")
Table 4: Classical likelihood ratio test (chi-square reference).
Model Resid. Df Resid. Dev. Df $\chi^2$ $p$-value
Satellites ~ Width 171 567.8786 NA NA NA
Satellites ~ Width + Dark 170 560.9576 1 6.92101 0.0085189
  • The LRT statistic is \(T = 2(\ell_{full} - \ell_{red})\), reported here as $\\chi^2$. For these data, $T_{obs} = $ 6.921.
  • The chi-square p-value (0.00852) is based on a large-sample reference distribution (\(T \overset{approx}{\sim} \chi^2_{df}\)).
  • If this p-value is small, it suggests the full model fits meaningfully better, i.e., Dark adds information beyond Width.

5.2 Parametric bootstrap LRT p-value (null simulation)

Generate bootstrap datasets under \(H_0\) by simulating new counts from the reduced model.

Assumptions under \(H_0\) (parametric bootstrap):

  • The reduced Poisson GLM is correctly specified under the null: conditional on Width, satellite counts follow a Poisson distribution with mean \(\mu_i\) and log link.
  • Observations are independent (each crab is an independent sampling unit).
  • The covariates (Width, Dark) are treated as fixed (or we condition on them). Under \(H_0\) the distribution of \(Y\) depends on Width but not on Dark.

This is not a permutation/exchangeability argument: we are not shuffling Dark labels. We are simulating \(Y\) from an assumed model \(Y \mid X \sim \text{Poisson}(\hat\mu_0(X))\).

n_crabs <- nrow(crabs)
mu_hat0 <- fitted(fit_red)

T_boot_parametric <- numeric(B)
for (b in seq_len(B)) {
  y_star <- rpois(n_crabs, lambda = mu_hat0)
  dat_star <- crabs
  dat_star$Satellites <- y_star

  fit_red_b <- glm(Satellites ~ Width, family = poisson(link = "log"), data = dat_star)
  fit_full_b <- glm(Satellites ~ Width + Dark, family = poisson(link = "log"), data = dat_star)

  T_boot_parametric[b] <- as.numeric(2 * (logLik(fit_full_b) - logLik(fit_red_b)))
}

p_lrt_parametric <- bootstrap_p_value(T_boot_parametric, T_obs, two_sided = FALSE)
p_lrt_parametric
[1] 0.009495252
  • This p-value estimates \(\Pr(T^* \ge T_{obs} \mid H_0)\) where \(T^*\) is generated by simulating new outcomes from the reduced (null) Poisson GLM.
  • Here, \(\hat p \approx\) 0.0095. If it’s close to the chi-square p-value, that suggests the asymptotic reference is adequate.
  • If it differs noticeably, the bootstrap is warning you that the chi-square approximation may be a bit off for this sample/model.

5.3 Permutation test (shuffling Dark, keeping Y and Width fixed)

If you want a fully nonparametric way to get a null distribution for the Dark effect, a natural option is a permutation test.

Idea: under \(H_0\) (Dark has no association with satellites after adjusting for width), the Dark labels should be exchangeable given Width. To make given Width more concrete, we do a stratified permutation: bin Width and only shuffle Dark within bins.

set.seed(123)

width_bin <- ntile(crabs$Width, 5)

T_boot_perm <- numeric(B)

for (b in seq_len(B)) {
  dat_perm <- crabs
  dark_perm <- dat_perm$Dark

  for (g in sort(unique(width_bin))) {
    idx_g <- which(width_bin == g)
    dark_perm[idx_g] <- sample(dark_perm[idx_g], size = length(idx_g), replace = FALSE)
  }

  dat_perm$Dark <- dark_perm

  fit_red_b <- glm(Satellites ~ Width, family = poisson(link = "log"), data = dat_perm)
  fit_full_b <- glm(Satellites ~ Width + Dark, family = poisson(link = "log"), data = dat_perm)

  T_boot_perm[b] <- as.numeric(2 * (logLik(fit_full_b) - logLik(fit_red_b)))
}

p_perm <- bootstrap_p_value(T_boot_perm, T_obs, two_sided = FALSE)
p_perm
[1] 0.133933
  • This p-value answers a different “null world” than the parametric bootstrap: it treats the outcome model as unknown and instead tests whether the Dark labels are (approximately) exchangeable within Width bins under \(H_0\).
  • Here, \(\hat p \approx\) 0.134. If this is small, then shuffling Dark (while preserving the Width structure) rarely produces an LRT statistic as large as the observed one.
  • If the permutation and parametric-bootstrap p-values disagree, that’s a signal to think carefully about which assumptions are more defensible for this dataset.
df_plot <- tibble(
  T = c(T_boot_parametric, T_boot_perm),
  method = rep(c("Parametric null", "Permutation null"), each = B)
)

ggplot(df_plot, aes(x = T)) +
  geom_histogram(bins = 40, color = "white") +
  geom_vline(xintercept = T_obs, linewidth = 1) +
  facet_wrap(~ method, ncol = 1, scales = "free_y") +
  labs(x = "LRT statistic", y = "Count") +
  theme_minimal()
Figure 1: Null distributions for the LRT statistic (parametric-bootstrap null and stratified permutation null), with the observed LRT marked.

6. Prediction intervals with bootstrap

Confidence intervals quantify uncertainty for an unknown parameter (or mean prediction). Prediction intervals are wider because they must also include the randomness of a new observation.

We’ll build a bootstrap prediction interval for the number of satellites for a new crab with a chosen width and shell color.

  • CI for the mean: uncertainty in \(\mu(x)\).
  • Prediction interval: uncertainty in a future \(Y\) at that \(x\) (mean uncertainty + Poisson noise).
new_crab <- tibble(Width = 26, Dark = factor("yes", levels = levels(crabs$Dark)))

pred_mean_boot <- numeric(B)
pred_y_boot <- numeric(B)

for (b in seq_len(B)) {
  idx <- sample.int(n_crabs, n_crabs, replace = TRUE)
  dat_b <- crabs[idx, ]

  fit_b <- glm(Satellites ~ Width + Dark, family = poisson(link = "log"), data = dat_b)
  mu_b <- as.numeric(predict(fit_b, newdata = new_crab, type = "response"))

  pred_mean_boot[b] <- mu_b
  pred_y_boot[b] <- rpois(1, lambda = mu_b)
}

ci_mean <- percentile_ci(pred_mean_boot)
pi_y <- percentile_ci(pred_y_boot)

list(
  mean_CI_95 = ci_mean,
  prediction_interval_95 = pi_y
)
$mean_CI_95
[1] "[1.553, 2.942]"

$prediction_interval_95
[1] "[0, 5]"
  • The 95% CI for the mean (mean_CI_95) is for the expected number of satellites for a crab with Width = 26 and Dark = yes.
  • The 95% prediction interval (prediction_interval_95) is for a single future crab’s observed count at that same \((Width, Dark)\). It is wider because it includes both parameter uncertainty and Poisson outcome noise.
  • If you care about the average response, use the CI; if you care about what an individual new observation might look like, use the prediction interval.

7. Conclusion

  • Bootstrap is a general-purpose tool for uncertainty when analytic formulas are fragile or inconvenient.
  • Nonparametric bootstrap resamples the observed data; parametric bootstrap simulates from a fitted model.
  • For confidence intervals, percentile bootstrap is a simple baseline; for hypothesis tests, bootstrap samples must reflect \(H_0\).
  • For GLMs and likelihood ratios, parametric (null) bootstrap is a natural way to avoid relying only on asymptotic chi-square approximations.
  • Prediction intervals via bootstrap are often the most directly useful output because they live on the outcome scale.