Bootstrap in Monte Carlo and the number of samples

In summary, the conversation discusses the use of bootstrap to estimate the standard deviation of a quantity calculated from a small sample of data from Montecarlo simulations. The speaker is finding very little variation with each iteration of bootstrap, possibly due to a small number of iterations (50) and a large number of data points (5000). They are seeking advice from those with experience with bootstrap analysis and are unsure of the significance level of their question. The conversation also raises questions about the size of the sample and the potential rarity of the event being simulated. The issue of small error bars resulting from bootstrapping is also mentioned.
  • #1
diegzumillo
173
18
I am analyzing a lot of data from Montecarlo simulations and trying to use bootstrap to estimate the standard deviation. what I'm finding however is very, very little variation with each iteration of bootstrap, and I don't know why.

Two reasons come to mind. Given the complexity of the analysis I cannot do more than 50 iterations, which sounds like too few according to sources out there, so maybe I just need more? Another thing is I have 5000 data points, resampling barely makes a dent on the histogram, so I'm not surprised it's not changing the statistics that much either.

Anyone with experience with bootstrap analysis have any idea?

PS: I have no idea what 'level' this question is.
 
Physics news on Phys.org
  • #2
diegzumillo said:
I am analyzing a lot of data from Montecarlo simulations and trying to use bootstrap to estimate the standard deviation.

The standard deviation of what?

"Standard deviation" is associated with the distribution of a random variable. What random variable are you talking about?
 
  • #3
Oh, I mean standard deviation of the thing I measured after the whole analysis. I didn't get in detail because it's not a direct quantity easy to detail. Schematically it's something like data > process data into a single quantity. Then bootstrapping resampled data > process into single quantity. Then I take all the generated processed quantities of each bootstrap and calculate the standard deviation of the whole set.
 
  • #4
diegzumillo said:
Schematically it's something like data > process data into a single quantity. Then bootstrapping resampled data > process into single quantity. Then I take all the generated processed quantities of each bootstrap and calculate the standard deviation of the whole set.

It's unclear what you mean and what you are doing.

If you have N independent samples of a random variable, there are estimators of the its standard deviation (e.g. the sample standard deviation) that use all N samples. it isn't clear why you are bootstrapping. How does what you are doing differ from having N independent samples of the random variable of interest?
 
  • #5
I can try explaining what I'm trying to do a little better. I have a set of data which can be used to calculate a quantity I'm interested in. For sake of example, say we want to calculate the skewness. I take the original data and calculate the skewness. But the data I have is itself a small sample of the entire sample space but I can't run simulations forever so the small sample I have will have to do. I'll use bootstrap, resample it again and again. Each time I resample I calculate the skewness, which will surprisingly be different each time. Then I calculate the standard deviation of all the skewnesses (this word might not exist) obtained with each iteration, as a way of saying how confident I am that the skewness I calculated originally is representative of the larger sample data (the one I don't have).

Sorry for being vague earlier. The thing is I don't know much about this stuff, and whenever I don't know much about something I tend to assume I'm the only dummy who doesn't know about it, therefore everyone else would recognize the problem without a lot of explanation. i.e. I was lazy and presumptuous.
 
  • #6
diegzumillo said:
II have a set of data which can be used to calculate a quantity I'm interested in. For sake of example, say we want to calculate the skewness.

It's important to know whether you want to estimate a property of a random variable versus a property of N samples of that random variable. For example, the standard deviation of a random variable is a different number that the standard deviation of the mean of 20 independent samples of that random variable. It isn't clear whether you are trying to estimate a parameter of a finite set of outcomes of a random variable or whether you are trying to estimate a parameter associated with the distribution of a single outcome of that random variable.

it isn't clear whether your 5000 data points are independent samples of the same random variable or whether they are generated by a process that introduces a dependence in their values - such as a Markov chain or an AIRMA process.
 
  • #7
1) Are you changing the random number seed from one run to the next?
2) Is the situation being simulated such that the random part is small compared to the whole?
3) Is the significant random part a rare event?
 
Last edited:
  • #8
diegzumillo said:
I am analyzing a lot of data from Montecarlo simulations and trying to use bootstrap to estimate the standard deviation. what I'm finding however is very, very little variation with each iteration of bootstrap, and I don't know why.

Two reasons come to mind. Given the complexity of the analysis I cannot do more than 50 iterations, which sounds like too few according to sources out there, so maybe I just need more? Another thing is I have 5000 data points, resampling barely makes a dent on the histogram, so I'm not surprised it's not changing the statistics that much either.

Anyone with experience with bootstrap analysis have any idea?

PS: I have no idea what 'level' this question is.

What is the meaning of the "5000" and of the "50" (as in 50 iterations)? Are you ultimately getting 50 samples of your quantity of interest, or are you getting 5000?

A sample of size 50 is somewhat "small", but people often need to deal with samples that small in applications. If the data are roughly normally distributed you can get confidence intervals on the variance by using the F-distribution.

A sample of size 5000 is really quite good, relative to what people often need to deal with in applications. An inference based on a sample of that size ought to be more "meaningful" than one based on a sample of size 50.
 
  • #9
Oh shoot. I thought this conversation had died after my last comment because I didn`'t get any notification (probably overlooked it).

This is still a problem to me, by the way. Bootstrapping still gives error bars unrealistically small.

The problem I'm working is a Monte Carlo simulation on a lattice (think Ising model), where I calculate for each temperature observables like magnetization for about 5000 different configurations. Then I calculate a density of states using Ferrenberg Swendsen algorithm.

I'm not very confident about bootstrapping this system because the data is correlated, as is usually the case in monte carlo methods. The Ferrenberg-Swendsen algorithm takes autocorrelation into account, so that's fine, but then bootstrapping? Shouldn't the data be uncorrelated?
 

FAQ: Bootstrap in Monte Carlo and the number of samples

What is Bootstrap in Monte Carlo?

Bootstrap in Monte Carlo is a statistical resampling technique used to estimate the sampling distribution of a statistic by creating new samples from the original dataset through random sampling with replacement.

Why is Bootstrap useful?

Bootstrap is useful because it allows us to estimate the variability and uncertainty of a statistic without making any assumptions about the underlying population distribution. This is particularly useful when the sample size is small or when the data does not follow a normal distribution.

How many samples should be used in Bootstrap?

The number of samples used in Bootstrap can vary, but it is generally recommended to use at least 1000 samples to get reliable results. However, the number of samples can also depend on the size of the original dataset and the level of accuracy desired.

What is the role of confidence intervals in Bootstrap?

Bootstrap can be used to calculate confidence intervals for a statistic by taking a certain percentage of the resampled values as the lower and upper bounds of the interval. This allows us to estimate the range of values in which the true population parameter is likely to fall.

Can Bootstrap be used for any type of data?

Bootstrap can be used for any type of data, including continuous, categorical, and non-parametric data. It is a non-parametric method, so it does not make any assumptions about the distribution of the data.

Back
Top