Central Limit Theorem: How does sample size affect the sampling distribution?

Agent Smith · Sep 19, 2024

In this course I took it says that the larger the sample size the more likely is the sampling distribution (of the sample means, guessing here) to be normal. This they say is The Central Limit Theorem. How does this work? How does someone taking a large sample affect the sampling distribution (of the sample means)?

I can see how taking large number of samples (not sample size) can lead to the sampling distribution (of the sample means) being a normal distribution (sample means will cluster around and/or include the population mean) centered on the population mean.

To reiterate my question: How does someone taking a large sample affect the sampling distribution (of the sample means)?
If there's any clarification that seems to be in order regarding my (mis)understanding of what's in the second paragraph (from top), kindly issue one.

Arigato gozaimus

Hornbein · Sep 19, 2024

If you add together an infinite number of small independent factors then you get a normal distribution. The only things that never fall into this paradigm is distributions where on rare occasions you get a number freakishly far from the mean.

There has been much study of how relaxed the criteria can be. As long as you don't get those sporadic outliers the rule of thumb is that a sample size of thirty is about enough. There are tests of normality but I've never taken a look at them.

To reiterate my question: How does someone taking a large sample affect the sampling distribution (of the sample means)?

The larger the sample, the less the variance of the sample means. Double the sample size, halve the variance of the sample means.

Dale · Sep 19, 2024

Agent Smith said:

TL;DR Summary: How does sample size affect the sampling distribution

How does someone taking a large sample affect the sampling distribution (of the sample means)?

When I run into questions like this, my first approach is to do a quick Monte Carlo simulation and see. Here I used a sampling distribution ##X \sim Bernoulli(0.5)## which has a mean ##\mu=0.5## and a variance ##\sigma^2=0.25##. According to the CVT the sampling distribution of ##\bar X## in the limit of ##N\rightarrow \infty## approaches ##\mathcal N (\mu, \sigma/\sqrt{N})##.

I did a quick Monte Carlo simulation for ##N=2## and got

I then did the same simulation for ##N=20## to get

Note that for small ##N## the distribution is not approximately normal. For ##N=2## the distribution can only take the values 0, 0.5, or 1, which is not true of the matching normal distribution. As ##N## increases, even rather modestly, the approximation becomes much better.

FactChecker · Sep 19, 2024

Agent Smith said:

TL;DR Summary: How does sample size affect the sampling distribution

In this course I took it says that the larger the sample size the more likely is the sampling distribution (of the sample means, guessing here)

That is correct.

Agent Smith said:

to be normal. This they say is The Central Limit Theorem. How does this work? How does someone taking a large sample affect the sampling distribution (of the sample means)?

This is talking about the random variable, ##\frac{\Sigma_{i=1}^N X_i}{N}##. Every time the sample size, ##N##, changes, that is a different random variable. So it has a different distribution.

WWGD · Sep 19, 2024

Note there are several types of convergence in the area of Probabilities. Here, the convergence is in Distribution.

Agent Smith · Sep 19, 2024

Hornbein said:

The larger the sample, the less the variance of the sample means. Double the sample size, halve the variance of the sample means.

Are you talking about ##\sigma_{\overline x} = \frac{\sigma}{\sqrt n}## and ##\sigma_{\hat p} = \frac{s_x}{\sqrt n}##? Here, ##\sigma_{\overline x}## and ##\sigma_{\hat p}## are the standard deviations of the sampling distribution of the sample means, oui?

@Dale the Monte Carlo simulation that you did shows that a larger sample size (N) better approximates the population's distribution. I was led to believe that the best-case scenario is for the population to have a normal distribution; then even if the sample size is small the sampling distribution of the sample means will be normal.

FactChecker said:

Every time the sample size, N, changes, that is a different random variable. So it has a different distribution.

This is an interesting remark. So the random variable X = the mean of a sample and E(X) = the mean of the sample means (for a particular sample size). Is there anything else I should note down?

Thanks to all

Agent Smith · Sep 19, 2024

WWGD said:

Note there are several types of convergence in the area of Probabilities. Here, the convergence is in Distribution.

As in the mean of the sample means is converging on the population mean?

Dale · Sep 19, 2024

Agent Smith said:

@Dale the Monte Carlo simulation that you did shows that a larger sample size (N) better approximates the population's distribution

No, a larger sample size N makes the distribution of the sample mean better approximate a normal distribution. The population’s distribution was a Bernoulli distribution.

Agent Smith said:

I was led to believe that the best-case scenario is for the population to have a normal distribution; then even if the sample size is small the sampling distribution of the sample means will be normal

Yes. That is precisely why I picked a Bernoulli distribution for the population. It is the most non normal distribution I could think of.

Hornbein · Sep 19, 2024

Dale said:

Yes. That is precisely why I picked a Bernoulli distribution for the population. It is the most non normal distribution I could think of.

Hah, I can beat that. Have X and Y be uniform on (0,1). Then X/Y never converges to a normal distribution. It doesn't even have a mean. Fortunately such a thing also never seems to show up in real life.

[Note: the open interval (0,1) doesn't include zero.]

Hornbein · Sep 19, 2024

Agent Smith said:

Are you talking about ##\sigma_{\overline x} = \frac{\sigma}{\sqrt n}## and ##\sigma_{\hat p} = \frac{s_x}{\sqrt n}##? Here, ##\sigma_{\overline x}## and ##\sigma_{\hat p}## are the standard deviations of the sampling distribution of the sample means, oui?

Right. Same thing.

Dale · Sep 19, 2024

Hornbein said:

It doesn't even have a mean

The CLT requires the distribution to have both a mean and a variance. And I was also just choosing among the distributions already programmed in Mathematica. Anyway Bernoulli is sufficient non normal to make the point.

WWGD · Sep 19, 2024

The Cauchy Distribution https://en.wikipedia.org/wiki/Cauchy_distribution is an example of a distribution without mean nor variance nor higher moments. Though I believe mean and median are defined.

Agent Smith · Sep 20, 2024

Dale said:

No, a larger sample size N makes the distribution of the sample mean better approximate a normal distribution. The population’s distribution was a Bernoulli distribution.

So if I take a large sample what happens is the distribution of samples of that size approaches a normal distribution. Gotcha!

Dale · Sep 20, 2024

Agent Smith said:

So if I take a large sample what happens is the distribution of samples of that size approaches a normal distribution. Gotcha!

Yes. Specifically the distribution of the sample means.

FactChecker · Sep 20, 2024

Agent Smith said:

So if I take a large sample what happens is the distribution of samples of that size approaches a normal distribution. Gotcha!

You need to be more careful. The distribution of the sample approaches the binomial. The distribution of the average of the sample approaches a normal distribution.

Agent Smith · Sep 20, 2024

FactChecker said:

You need to be more careful. The distribution of the sample approaches the binomial. The distribution of the average of the sample approaches a normal distribution.

I learned that when making inferences from samples, the following conditions need to be met:
1. Random condition (The sampling must be random)
2. Independence condition (Selection with replacement or 10% rule)
3. Normal condition (The sampling distribution of the sample means has to be normal).

I have a fair idea of 1 and 2, but a bit confused about 3. Do the requirements that ...
1. The sample size >= 30
2. The parent population has a normal distribution
3. The number of successes >= 10 and the number of failures >= 10
4. The sample has to have a "uniform distribution" with no outliers.
give us some kind of assurance that the sampling distribution of the sample means is normal?\

@Dale and others

FactChecker · Sep 21, 2024

Maybe I should have said that "You need to be more careful about your statement."
Those are all good "rules of thumb". But where you say "is normal", you should really say "is close enough to normal for your use".

Agent Smith · Oct 7, 2024

Dale said:

According to the CVT the sampling distribution of X¯ in the limit of N→∞ approaches N(μ,σ/N).

Do you mean CLT (instead of CVT), for Central Limit Theorem?

in the limit of N→∞ approaches N(μ,σ/N). How can ##N \to \infty##? The maximum the sample size can get is the population size (finite)?

N(μ,σ/N) is this to be interpreted as normal distribution, with mean = ##\mu## (the population mean) and standard deviation = ##\sigma_{\overline x} = \sigma/N## where ##\sigma## is the population standard deviation and N = the sample size? I guess we're referring to the sampling distribution of the sample means.

Dale · Oct 7, 2024

Agent Smith said:

Do you mean CLT (instead of CVT), for Central Limit Theorem?

Oops, yes.

Agent Smith said:

in the limit of N→∞ approaches N(μ,σ/N). How can N→∞? The maximum the sample size can get is the population size (finite)?

Sample from an infinitely large population.

Agent Smith said:

N(μ,σ/N) is this to be interpreted as normal distribution, with mean = μ (the population mean) and standard deviation = σx―=σ/N where σ is the population standard deviation and N = the sample size? I guess we're referring to the sampling distribution of the sample means.

Yes.

Agent Smith · Oct 10, 2024

I would like to ask a question about The Law of Large Numbers.

A video I watched states that, given a population and we take n samples (size not specified) and compute the mean of each sample ##\overline x##, what you'll find is that ##\displaystyle \lim_{n \to \infty} \frac{1}{n} \sum \overline x = \mu##, where ##\mu## is the population mean. Is this correct?

Agent Smith · Oct 10, 2024

Dale said:

Sample from an infinitely large population.

Don't mean to offend but someone laughed at me once because I "found" a mathematical pattern in the first 100 numbers (1 to 100). Then someone else asked, making the same point I suppose, "what is ##\frac{100}{\infty}##?"

Dale · Oct 11, 2024

Agent Smith said:

Don't mean to offend but someone laughed at me once because I "found" a mathematical pattern in the first 100 numbers (1 to 100). Then someone else asked, making the same point I suppose, "what is ##\frac{100}{\infty}##?"

No offense taken. It isn’t my idea, it is the standard meaning. It is also problematic for related reasons.

Agent Smith said:

I would like to ask a question about The Law of Large Numbers.

A video I watched states that, given a population and we take n samples (size not specified) and compute the mean of each sample ##\overline x##, what you'll find is that ##\displaystyle \lim_{n \to \infty} \frac{1}{n} \sum \overline x = \mu##, where ##\mu## is the population mean. Is this correct?

One of the biggest problems is that the limit that you wrote down is not a limit in the usual calculus sense. In calculus that means that for any small ##\epsilon## there exists some ##N## such that for ##N<n## the difference between that series and the limit is less than ##\epsilon##. But with random variables there is no such guarantee. You can always find ##N## such that the difference between the series and the limit is probably less than ##\epsilon## to any level of probability.

So it is indeed a little odd. Correctly written limits about probabilities give probabilities, and those limits are usually expressed in terms of infinite populations or infinite sets of samples from a finite population.

Agent Smith · Oct 11, 2024

Should I have written ##\displaystyle \lim_{n \to \infty} \frac{1}{n} \sum_{i = 1} ^n \overline x_i = \mu##? In words, the mean of the sample means approaches the true mean of the population as the number of samples approaches infinity. Not sure if that's the actual statement or not. How would you write down the correct expression? @Dale

What about my question regarding the sample size? Why do we assume the population is infinite?

Dale · Oct 12, 2024

So, technically the correct limit is a limit of a probability distribution: $$\lim_{n\rightarrow \infty} P\left( \sqrt{n}(\bar X_n - \mu)\le {z} \right) = \Phi \left( \frac{z}{\sigma} \right)$$ where ##\bar X_n## is the sample mean of ##n## samples from a distribution with mean ##\mu## and variance ##\sigma^2##. In other words, regardless of the distribution of ##X##, as ##n## approaches infinity the distribution of ##\sqrt{n}(\bar X_n-\mu)## is ##\mathcal{N}(0,\sigma^2)##

The technical problem with what you wrote is that it essentially claims that there is a guarantee that for large ##n## your sample mean ##\bar X_n## will be arbitrarily close to the population mean ##\mu##. But this is probability, so there are not guaranteed. All that you can say is that arbitrarily large deviations of ##\bar X_n## from ##\mu## are arbitrarily unlikely. That is why it has to be expressed in terms of probability distributions rather than numbers.

Agent Smith said:

Why do we assume the population is infinite?

Because we are taking a limit as ##n## goes to ##\infty##.

Agent Smith · Oct 12, 2024

@Dale where can I do a Monte Carlo simulation? Do you have a place I can go to? I remember reading about The Law of Large Numbers (in probability) as the empirical probability ##\to## the theoretical probability as the number of trials ##\to \infty##. I once did a simulation with coin flips and the number of heads did approach 50% of the total flips as I increased the number of flips/trials. As you can see I'm a bit off the mark as to the concept's meaning.

Agent Smith · Oct 12, 2024

Dale said:

Because we are taking a limit as n goes to ∞.

We can take an infinite number of samples from a finite population?

Dale · Oct 12, 2024

I use Mathematica for most of my Monte Carlo simulations, but that is just because I have been using it for about 30 years.

You can do Monte Carlo simulations in Matlab, Python, R, and probably most other languages. I would probably recommend just using whatever language is already your favorite.

Dale · Oct 12, 2024

Agent Smith said:

We can take an infinite number of samples from a finite population?

What we can or cannot actually do is not particularly relevant. This is just math. We often hypothesize infinite data sets or populations when doing so makes a problem mathematically tractable or convenient. We can then take our idealized math and see how well it matches real systems where things are not infinite or independent.

FactChecker · Oct 13, 2024

Agent Smith said:

We can take an infinite number of samples from a finite population?

A coin has only two sides, but you can flip a single coin a million times. A lot depends on random selection "with replacement" versus "without replacement".

FactChecker · Oct 13, 2024

Agent Smith said:

@Dale where can I do a Monte Carlo simulation?

Any computer language that has a random number generator function (rand, randu, etc.) can be used to do Monte Carlo simulations. Some languages have functions that allow you to generate a selection of well known distributions.
Here is Perl code for a simple Monte Carlo simulation of coin flips:

Monte Carlo Simulation of Coin Flips:

$numberOfFlips = 1000000;  # set the number of coin flips to do
print "Simulating $numberOfFlips coin flips\n";
foreach $i (1..$numberOfFlips){ # do each coin flip
   $flip = rand();              # generate a random number from a uniform distribution in [0,1)
   if( $flip gt 0.5 ){        # if the random number > 0.5, count it as a head
      $numberOfHeads++;
   }else{
      $numberOfTails++;       # if the random number is <= 0.5, count it as a tail
   }
}
# print results
$fractionHeads=$numberOfHeads/$numberOfFlips;
$fractionTails=$numberOfTails/$numberOfFlips;
print "The fraction of heads is $fractionHeads\n";
print "The fraction of tails is $fractionTails\n";

When I ran it, it produced this result:

Result:

Simulating 1000000 coin flips
The fraction of heads is 0.499483
The fraction of tails is 0.500517

FactChecker · Oct 13, 2024

Agent Smith said:

@Dale where can I do a Monte Carlo simulation?

I should tell you that there are computer languages and systems that are designed specifically to do Monte Carlo simulations. If you are going to do large simulations, you should look into those.

I would also say that simple problems with analytical solutions are very easy to modify so that the analytical solutions become a nightmare. In those cases, Monte Carlo estimates are often much easier to get and to be confident of. Even if analytical solutions are still possible, the Monti Carlo estimate can provide a good "sanity check" for the analysis.
For instance, suppose in the coin toss example we added the requirement that a Head will not count if 3 of the prior 5 tosses were Heads. That would be trivial to add to the Monte Carlo simulation, but the analytical solution would be more difficult. Although this example seems artificial, the real world often gets complicated like that.

statdad · Nov 7, 2024

Agent Smith said:

Should I have written ##\displaystyle \lim_{n \to \infty} \frac{1}{n} \sum_{i = 1} ^n \overline x_i = \mu##? In words, the mean of the sample means approaches the true mean of the population as the number of samples approaches infinity. Not sure if that's the actual statement or not. How would you write down the correct expression? @Dale

What about my question regarding the sample size? Why do we assume the population is infinite?

we say that the sample mean converges in probability to the population mean: that is, given n epsilon, it is true that the limit as n goes to infinity of P(|Xbar - mu| > epsilon) = 0 (or, equivalently, the limit of
P(|Xbar - mu| <= epsilon) = 1).

It's only when the sequence is standardized as done in other posts that the normal distribution comes into play.

Central Limit Theorem: How does sample size affect the sampling distribution?

Similar threads

Hot Threads

Recent Insights