Nonparametric bootstrap: Assumptions and number of bootstrap samples?

In summary: There may be ways to bootstrap a statistic that is pretty good even thought the confidence intervals of the bootstrap distribution don't converge to the true value.In summary, bootstrapping can be a useful tool for estimating population parameters, but it is important to carefully consider the assumptions and limitations of the method, as well as the interpretation of confidence intervals. The number of bootstrap samples to take can depend on the accuracy desired and the properties of the original sample, and assumptions such as finite variance should be met for the procedure to work effectively. However, there is no guarantee that the results will converge to the true value of the parameter, so caution should be exercised in interpreting the results.
  • #1
madilyn
13
0
I've been figuring out the use of the nonparametric bootstrap and if I understand correctly, this is the procedure:

1. Take an original sample, a vector x = (x1, ..., xn)

2. Generate k vectors, each called a 'bootstrap sample', of the same length as x by random sampling (with replacement) of the original vector, x, e.g. I have b1 = (x3, xn, x1, ..., x13), b2 = (x2, x5, ..., xn) etc.

3. Now I calculate any statistic [itex] \hat{\theta} = f(\bf{x})[/itex] on each bootstrap sample and my bootstrapped statistic is the mean of the statistic across the distribution and my confidence intervals on that bootstrapped statistic can be found using the inverse cdf of a normal distribution.

If everything is correct above, I have two questions:

i. How do I determine the number of bootstrap samples to take, k? Is there a principled way to determine this? Without this, I would just have to keep repeating the same procedure with increasing k until there's some kind of convergence on the mean [itex] \bar{\hat{\theta}} [/itex]? But this seems computationally taxing.

ii. What assumptions must be correct for this procedure to work? I'm guessing that [itex] \hat{\theta} [/itex] must have finite variance? What else?
 
Physics news on Phys.org
  • #2
I think your description of the procedure is correct.


madilyn said:
ii. What assumptions must be correct for this procedure to work? I'm guessing that [itex] \hat{\theta} [/itex] must have finite variance? What else?

That's a good question. A major task is define what it means to say it "works". Do you have a sophisticated understanding of the meaning of a confidence interval? In particular, do you understand that the usual sort of confidence interval does NOT let you make statements about a population parameter being in a specific numerical interval. (For example, you can't conclude things like "There is a 90% chance that the population mean is in the interval [ 2.3 - 0.62, 2.3 + 0.62]. )

If you mean [itex] \bar{\hat{\theta}} [/itex] to be an estimate of a parameter of the distribution from which the sample is taken, I don't think there is any guarantee that the results of the bootstrap "converge" to the value of that parameter as you generate more boostrap samples.
 
  • #3
Stephen Tashi said:
I think your description of the procedure is correct.

That's a good question. A major task is define what it means to say it "works". Do you have a sophisticated understanding of the meaning of a confidence interval? In particular, do you understand that the usual sort of confidence interval does NOT let you make statements about a population parameter being in a specific numerical interval. (For example, you can't conclude things like "There is a 90% chance that the population mean is in the interval [ 2.3 - 0.62, 2.3 + 0.62]. )

If you mean [itex] \bar{\hat{\theta}} [/itex] to be an estimate of a parameter of the distribution from which the sample is taken, I don't think there is any guarantee that the results of the bootstrap "converge" to the value of that parameter as you generate more boostrap samples.

Stephen, thanks for your prompt answers as always!

1. Unfortunately no, I don't have a very sophisticated understanding of the meaning of a confidence interval (I wouldn't be able to write a philosophical debate about it). But I do have a basic grasp of the pitfalls. What's one school of thought I could practice on "what works" without going too deep into the foundations?

2. Hm, that's problematic. How would I know what's a good bootstrap approximation if it doesn't converge?

I'm sorry if I sound like I'm just looking for more clues but I don't have a strong intuition on how to attack this.
 
  • #4
A confidence interval isn't the same as a "credible interval" (http://en.wikipedia.org/wiki/Credible_interval).

Suppose we are trying to estimate a property of a population by bootstrapping. We have a large batch of samples and from it we repeatedly select smaller bathches. How close our estimate is to the actual value of the populaton parameter depends on 1)How well the large batch of samples matches the population distribution and 2) How we estimate the parameter from the bootstrap samples

Once you have the large batch of samples, you can usually produce smaller (frequentist) confidence intervals by doing more bootstrap sampling in 2). More boostrap samples can't improve the mis-estimation that may be introduced in 1). The total confidence interval size depends on both 1) and 2).

It would be easier to discuss bootstrapping if we discuss estimating a specific thing.
 
  • #5


I can provide the following response to the content about nonparametric bootstrap:

1. Yes, your understanding of the procedure for nonparametric bootstrap is correct. It involves generating multiple bootstrap samples from the original sample, calculating a statistic of interest on each sample, and using the distribution of these statistics to estimate the population parameter and construct confidence intervals.

2. The number of bootstrap samples, k, can be determined using a few different approaches. One way is to use the "rule of thumb" which suggests using at least 1000 bootstrap samples. However, the number of samples can also depend on the complexity of the data and the desired level of precision. In general, the more samples you have, the more accurate your estimates will be. But as you mentioned, this can be computationally taxing. Another approach is to use a "bootstrapping stopping rule", where you continue generating bootstrap samples until the results stabilize or converge. This can save computational time while still providing accurate estimates.

3. The nonparametric bootstrap method does not rely on any specific assumptions about the underlying distribution of the data. However, there are a few assumptions that must be met for the procedure to work effectively. These include:
- The original sample is representative of the population.
- The observations in the original sample are independent.
- The statistic of interest, \hat{\theta}, has a finite variance.
- The sampling process is consistent (i.e. the same results would be obtained if we were to take multiple samples from the same population).
- The original sample is large enough to provide reliable estimates.

In summary, the nonparametric bootstrap method is a powerful tool for estimating population parameters and constructing confidence intervals without making assumptions about the underlying distribution of the data. However, it is important to carefully consider the number of bootstrap samples and ensure that the necessary assumptions are met for accurate results.
 

FAQ: Nonparametric bootstrap: Assumptions and number of bootstrap samples?

What are the assumptions for using nonparametric bootstrap?

The main assumption for using nonparametric bootstrap is that the sample data is representative of the population. Additionally, the data should be independent and identically distributed (iid). This means that each data point is independent of the others and comes from the same underlying distribution.

Can nonparametric bootstrap be used with small sample sizes?

Yes, nonparametric bootstrap can be used with small sample sizes. In fact, it is often used when the sample size is small and the underlying distribution is unknown or not easily modeled by a parametric distribution. However, it is important to have a sufficient number of bootstrap samples to accurately represent the population distribution.

What is the recommended number of bootstrap samples to use?

The recommended number of bootstrap samples to use depends on the complexity of the data and the desired level of accuracy. Generally, a minimum of 1000 bootstrap samples is recommended, but for more complex data, a higher number (e.g. 5000 or more) may be necessary to accurately represent the population distribution. It is also recommended to test the stability of the results by increasing the number of bootstrap samples and comparing the results.

Can nonparametric bootstrap be used with any type of data?

Yes, nonparametric bootstrap can be used with any type of data. It is commonly used with numerical data, but it can also be used with categorical or ordinal data. It is important to ensure that the assumptions of independence and identical distribution are met for the data being bootstrapped.

How can nonparametric bootstrap be used for hypothesis testing?

Nonparametric bootstrap can be used for hypothesis testing by comparing the results from the bootstrap samples to the observed data. The null hypothesis is assumed to be true, and the bootstrap samples are generated under this assumption. If the observed data falls within the range of values from the bootstrap samples, then the null hypothesis is accepted. However, if the observed data is an extreme outlier compared to the bootstrap samples, then the null hypothesis is rejected.

Similar threads

Back
Top