Bootstrap Confidence Intervals

Author

Oliver d’Pug

Bootstrapping.QMD and its output contain information on how to compute/generate bootstrap samples and compute the bootstrap distribution of a statistic. The various approaches that are presented allow the computation of bias and standard error estimates. In what follows, we make use of the boot package to fit the bootstrap distributions, and from the distribution a number of different confidence intervals for the parameter of interest.

The boot Package’s Bootstrap Distribution

The boot package can be obtained from CRAN. Once loaded, it is easy to use the boot function to create bootstrap distributions for a statistic that we have defined. Below we create functions that boot can call.;

Estimate \theta = \mu for N\left(\mu,\, \sigma_0^2\right)

In Rao-Blackwell and Lehmann-Scheffe we saw that it is possible to update a statistic T to make it unbiased. Consider a n random variables X_i \stackrel{iid}{\sim} N(\mu,\, \sigma_0^2) and a statistic T=\sum_{i=1}^n X_i. Since E(T) = n \mu \ne \mu we wish to find an unbiased estimator.

If we let U = X_1 then E(U) = \mu then Rao-Blackwell shows that V = \overline{X} is an unbiased estimator of \mu. To confirm this we can look at the bootstrap distributions of T, U, and V. We start by defining a function that computes the statistics within the boot function.

Code

  normal_mystat = function(d, i){
                      n = length(i)    ### Not used and a waste of time
                      
                      T = sum(d[i])    ### Sum_{i=1}^n X_i
                      U = d[i[1]]      ### X_1
                      V = mean(d[i])   ### \overline{X}
                           
                      c(T, U, V)       ### Return the list
                  }

We generate a number of observations from a normal distribution of our choice.

Code

  ### Set the seed to 47 to replicate output.  
  set.seed(47)   ### Comment this line out to use the clock.

  n = 100
  mu = 3
  s = 5
  
  normal_data = rnorm(n, mu, s)
  
  write.csv(normal_data, "normal_data.csv")

We can now use the normal_mystat function to get the bootstrap distributions.

Code

  ### Use the boot function to run the bootstrap
  normal_boot = boot(normal_data, normal_mystat, R=9999)

  ### Check the behavior of the statistics
  normal_boot


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = normal_data, statistic = normal_mystat, R = 9999)


Bootstrap Statistics :
      original       bias    std. error
t1* 325.744525 -0.423393654  48.9468515
t2*  12.973482 -9.701413166   4.8954405
t3*   3.257445 -0.004233937   0.4894685

Code

  summary(normal_boot)


Number of bootstrap replications R = 9999 
  original   bootBias   bootSE  bootMed
1 325.7445 -0.4233937 48.94685 325.7732
2  12.9735 -9.7014132  4.89544   3.1980
3   3.2574 -0.0042339  0.48947   3.2577

Code

  plot(normal_boot)

Code

  hist(normal_boot$t[,1])

Code

  qqPlot(normal_boot$t[,1], distribution="norm")

[1] 4670 7966

Code

  hist(normal_boot$t[,2])

Code

  qqPlot(normal_boot$t[,2], distribution="norm")

[1] 12 26

Code

  hist(normal_boot$t[,3])

Code

  qqPlot(normal_boot$t[,3], distribution="norm")

[1] 4670 7966

Note that the distributions of the sum and mean are both clearly normal. On the other hand, while normal by definition, a single observation appears to be less normal.

Code

  quantile(normal_boot$t[,3], c(0.025, 0.975))

    2.5%    97.5% 
2.292393 4.205230

Code

  args(boot.ci)

function (boot.out, conf = 0.95, type = "all", index = 1L:min(2L, 
    length(boot.out$t0)), var.t0 = NULL, var.t = NULL, t0 = NULL, 
    t = NULL, L = NULL, h = function(t) t, hdot = function(t) rep(1, 
        length(t)), hinv = function(t) t, ...) 
NULL

Code

  boot.ci(normal_boot, type="all", index=3)

Warning in boot.ci(normal_boot, type = "all", index = 3): bootstrap variances
needed for studentized intervals

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 9999 bootstrap replicates

CALL : 
boot.ci(boot.out = normal_boot, type = "all", index = 3)

Intervals : 
Level      Normal              Basic         
95%   ( 2.302,  4.221 )   ( 2.310,  4.223 )  

Level     Percentile            BCa          
95%   ( 2.292,  4.205 )   ( 2.287,  4.200 )  
Calculations and Intervals on Original Scale

Code

  x = as.data.frame(normal_boot$t)
  colnames(x) = c("T", "U", "V")
  x |>
      pivot_longer(cols =c("T","U","V"), 
                   names_to = "statistic",
                   values_to = "mu_hat", 
                   values_drop_na = TRUE) |>
      ggplot(aes(y = statistic, x = mu_hat)) +
        geom_boxplot() +
        geom_vline(xintercept=c(mu, n*mu), color="maroon", lty=4) +
        geom_vline(xintercept=c(normal_boot$t0), color="green", lty=5)

We can check that the theoretical and empirical bias and standard errors are similar.

Code

  normal_bias = as.data.frame(cbind(summary(normal_boot)$original - c(n*mu, mu, mu), 
                                    summary(normal_boot)$bootBias, 
                                    c(0,0,0)))
  colnames(normal_bias) = c("Original", "Bootstrap", "Theoretical")
  rownames(normal_bias) = c("Sum","X_1","Xbar")
  normal_bias

       Original    Bootstrap Theoretical
Sum  25.7445253 -0.423393654           0
X_1   9.9734817 -9.701413166           0
Xbar  0.2574453 -0.004233937           0

Code

  normal_ses = as.data.frame(cbind(summary(normal_boot)$bootSE, c(sqrt(n)*s, s, s/sqrt(n))))
  colnames(normal_ses) = c("Bootstrap", "Theoretical")
  rownames(normal_ses) = c("Sum","X_1","Xbar")
  normal_ses

      Bootstrap Theoretical
Sum  48.9468515        50.0
X_1   4.8954405         5.0
Xbar  0.4894685         0.5

Note that boot.ci uses its own quantile function rather than relying upon the base quantile function. Minor interpolation differences between interval estimates are common.

Estimate \theta for U(0,\theta)

As we saw earlier, Lehmann-Scheffe II can be used to get an estimator for \theta when we have n random variables X_i \stackrel{iid}{\sim} U(0,\, \theta) and a statistic T=X_{[n]}. Since E(T) = n \theta/(n+1) \ne \theta is based upon a minimal sufficient statistic and T can be shown to be complete, we need only to find a constant that makes our new statistic have expectation \theta. We note that V = (n+1)T/n has expectation E(V) = E\left(\frac{n+1}{n}T\right) = \frac{n+1}{n}\frac{n}{n+1} \theta = \theta So, V is UMVUE for \theta.

We can check the behavior of the unbiased estimators U = 2 \overline{X} and V = (n+1)X_{[n]}/n using bootstrapping.

Code

  uniform_mystat = function(d, i){
                      n = length(i)               ### Require for V
                      
                      T = max(d[i])               ### X_([n]) is biased
                      U = 2 * mean(d[i])          ### 2 * \overline{X}
                      V = (n+1) * T / n           ### (n+1) X_{[n]} / n
                           
                      c(T, U, V)                  ### Return the list
                  }

We generate a number of observations from a U(0,\, \theta) distribution of our choice.

Code

  ### Set the seed to 47 to replicate output.  
  set.seed(47)   ### Comment this line out to use the clock.

  n = 100
  theta = 5
  
  uniform_data = runif(n, 0, theta)
  
  write.csv(uniform_data, "uniform_data.csv")

We can now use the uniform_mystat function to get the bootstrap distributions.

Code

  ### Use the boot function to run the bootstrap
  uniform_boot = boot(uniform_data, uniform_mystat, R=9999)

  ### Check the behavior of the statistics
  uniform_boot


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = uniform_data, statistic = uniform_mystat, R = 9999)


Bootstrap Statistics :
    original        bias    std. error
t1* 4.980499 -0.0409126659  0.06218675
t2* 4.730612  0.0007536833  0.28421486
t3* 5.030304 -0.0413217926  0.06280862

Code

  summary(uniform_boot)


Number of bootstrap replications R = 9999 
  original    bootBias   bootSE bootMed
1   4.9805 -0.04091267 0.062187  4.9805
2   4.7306  0.00075368 0.284215  4.7297
3   5.0303 -0.04132179 0.062809  5.0303

Code

  plot(uniform_boot)

Code

  hist(uniform_boot$t[,1])

Code

  hist(uniform_boot$t[,2])

Code

  hist(uniform_boot$t[,3])

Note that the distributions of the sum and mean are both clearly normal. On the other hand, while normal by definition, a single observation appears to be less normal.

Code

  quantile(uniform_boot$t[,3], c(0.025, 0.975))

    2.5%    97.5% 
4.863812 5.030304

Code

  args(boot.ci)

function (boot.out, conf = 0.95, type = "all", index = 1L:min(2L, 
    length(boot.out$t0)), var.t0 = NULL, var.t = NULL, t0 = NULL, 
    t = NULL, L = NULL, h = function(t) t, hdot = function(t) rep(1, 
        length(t)), hinv = function(t) t, ...) 
NULL

Code

  uniform_boot.ci = boot.ci(uniform_boot, index=3)

Warning in boot.ci(uniform_boot, index = 3): bootstrap variances needed for
studentized intervals

Code

  uniform_boot.ci

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 9999 bootstrap replicates

CALL : 
boot.ci(boot.out = uniform_boot, index = 3)

Intervals : 
Level      Normal              Basic         
95%   ( 4.949,  5.195 )   ( 5.030,  5.197 )  

Level     Percentile            BCa          
95%   ( 4.864,  5.030 )   ( 4.864,  5.030 )  
Calculations and Intervals on Original Scale

Code

  x = as.data.frame(uniform_boot$t)
  colnames(x) = c("T", "U", "V")
  x |>
         pivot_longer(cols =c("T","U","V"), 
                      names_to = "statistic",
                      values_to = "theta_hat", 
                      values_drop_na = TRUE) |>
         ggplot(aes(y = statistic, x = theta_hat)) +
            geom_boxplot() +
            geom_vline(xintercept=theta, color="maroon", lty=2)

Other Formats

The boot Package’s Bootstrap Distribution

Estimate \theta = \mu for N\left(\mu,\, \sigma_0^2\right)

Estimate \theta for U(0,\theta)