Why Is Sample Variance Calculated with n-1 Instead of n?

In summary, the sample variance calculation in a book explains that the denominator is divided by n-1 instead of n because variance in samples will be likely to be lower than the population variance, so we divide by n-1 to make the variance larger. However, when studying t-distribution, with small n, the distribution has fat tail. With larger n, the tail of distribution becomes thinner. So it seems like with small n, it has larger variance.
  • #1
leslieg
2
0
I am new in statistic. I come across the sample variance calculation in a book and it explains that denominator is divided by n-1 instead of n is because variance in samples will be likely to be lower than the population variance, so we divide by n-1 to make the variance larger.

However, when I studied t-distribution, with small n, the distribution has fat tail. With larger n, the tail of distribution becomes thinner. So it seems like with small n, it has larger variance. If I treat small n as the case of the sampling above, the value of variance seems to contradict each other (first case states it would be smaller and second case states it would be larger). Could someone help me with this?

Thanks.
 
Physics news on Phys.org
  • #2
leslieg said:
Could someone help me with this?

The first thing to do is to get your terminology straightened out. When you talk about "the variance" you aren't making any distinction between the the variance of a population versus estimates of that variance computed from a sample. And you are making statements about different populations without saying what they are.

One of the reasons that statistics gets conceptually complicated is that the typical scenario involves at least two populations. The first population is usually something simple like the population of peoples weights. This population will usually have a distribution (such as a lognormal distribution) that is defined by a set of parameters (such as the mean and variance).

In a problem where we are attempting to estimate these parameters, we usually do a computation based on the values of N independent samples from the first population. The result of this computation is a "statistic". Since the sample values are random, a "statistic" is a random variable. This is in contradiction to the layman's idea that a "statistic" is a single numerical value. It is the population parameters, such as the mean weight of the population, that can be thought of as single numerical values. One may also think of the sample mean from one particular sample as a single numerical value. But a "statistic" is a random variable. The statistic has its own population of possible values. This population has a probability distribution that is usually defined by its own set of parameters (mean ,variance etc.). So we have a second population involved.

Try to express what is bothering you using the correct terminology and see if there is really any contradiction involved.
 
  • #3
When one calculates the mean (mathematical expectation) of a sample variance, the factor n-1 is needed so that the mean of the sample variance equals the population variance.
 
  • #4
mathman said:
When one calculates the mean (mathematical expectation) of a sample variance, the factor n-1 is needed so that the mean of the sample variance equals the population variance.

This. If the mean is known, you would compute the variance as follows:

[tex]E[\frac{\Sigma(X_i - \mu)^2}{n}] = \frac{\Sigma E[(X_i - \mu)^2]}{n} = \frac{\Sigma \sigma^2}{n} = \frac{n * \sigma^2}{n} = \sigma^2[/tex]

If the mean is unknown, you have to estimate it with the sample mean, x-bar, and estimate the variance using the sample variance, which has a different mean as you will see:

[tex]E[\Sigma(X_{i} - \bar{x})^2] = E[\Sigma(X_i^2 - 2X_i + \bar{x}^2] = E[\Sigma(X_i^2 - 2X_i + \bar{x}^2 + 2\mu X_i - 2\mu X_i + \mu ^2 - \mu ^2] = E[\Sigma(X_i - \mu )^2 - (\bar{x} - \mu)^2] = \Sigma E[(X_i - \mu )^2] - \Sigma E[(\bar{x} - \mu)^2[/tex]

so,

[tex]E[\Sigma(X_{i} - \bar{x})^2] = n * Var(X_i) - n * Var(\bar{x}) = n * \sigma ^2 - \sigma ^2 = (n-1) \sigma ^2[/tex]

To get rid of the n-1 and make it unbiased, we use the sample variance with the n-1 in the denominator as you see in your textbook:

[tex]s^2 = \frac{\Sigma (X_i - \bar{x})^2}{n - 1}[/tex]
 
  • #5
Thanks for all the reply.

Why I think there is contradiction is because:

Let's say for the case of t distribution:

1.) If the sample size, n is small, it has fatter tail and larger variance. It is like I taking sample out of the large population to estimates the variance.
2.) If the sample size, n is very very large, it has thinner tail and smaller variance. I think it should be very close to the variance of the population due to very large sample size.

If I compare case 1 and case 2, case 1 has larger variance than case 2 which translates to sample size variance is larger than the population variance. This contradicts with dividing n-1 in the denominator during the calculation of the sample variance.

I think I should be missing some important point here, but I could not figure out what is the problem with this thinking process.

Thanks
 
  • #6
leslieg,

You are still not being specific about what population you are talking about. If we have a population P1 and we use a students-T statistic on samples from P1 then (as I mentioned previously) this introduces a second population P2, namely the possible values of the students-T statistic.

Your statements 1) and 2) do not imply "sample size variance is larger than the population variance". It isn't even clear what "sample size variance" means. It isn't clear what population you are talking about.

You would be correct to say that the variance of the population P2 of values of the students-T statistic decreases as the sample size increases. The sample size has no effect whatsoever on the variance of the population P1.

The purpose using n-1 in the quantity S that is part of the students-T statistic is to make the average value of the estimates equal to the actual variance of P1. If you used n instead, you wouldn't change the variance of the population P1. You would change the variance of population P2 as well as change the average value of P2.
 
  • #7
leslieg said:
Thanks for all the reply.

Why I think there is contradiction is because:

Let's say for the case of t distribution:

1.) If the sample size, n is small, it has fatter tail and larger variance. It is like I taking sample out of the large population to estimates the variance.
2.) If the sample size, n is very very large, it has thinner tail and smaller variance. I think it should be very close to the variance of the population due to very large sample size.

If I compare case 1 and case 2, case 1 has larger variance than case 2 which translates to sample size variance is larger than the population variance. This contradicts with dividing n-1 in the denominator during the calculation of the sample variance.

I think I should be missing some important point here, but I could not figure out what is the problem with this thinking process.

Thanks

I think the apparent contradiction is because you're interpreting the student's t distribution as the sample variance, which it isn't. Really it's a way of incorporating the sample variance into the central limit theorem, so
[tex]T=\frac{(\bar{X}_n-\mu)\sqrt{n}}{S_n}[/tex]
should converge to a standard normal random variable as n tends to infinity. If you want to compare something with the sample variance you should look at the Chi-square distribution instead.
 
  • #8
leslieg said:
Thanks for all the reply.

Why I think there is contradiction is because:

Let's say for the case of t distribution:

1.) If the sample size, n is small, it has fatter tail and larger variance. It is like I taking sample out of the large population to estimates the variance.
2.) If the sample size, n is very very large, it has thinner tail and smaller variance. I think it should be very close to the variance of the population due to very large sample size.

If I compare case 1 and case 2, case 1 has larger variance than case 2 which translates to sample size variance is larger than the population variance. This contradicts with dividing n-1 in the denominator during the calculation of the sample variance.

I think I should be missing some important point here, but I could not figure out what is the problem with this thinking process.

Thanks

In both case 1 and case 2 you would use the sample variance if the mean is unknown and divide by n-1. If the mean is known, you would divide by n. It does not matter what the actual distribution is or how many samples you have, only whether you know the true value of the mean.
 
  • #9
leslieg,

Another way that you are getting tangled up in words is that your are not making a distinction between "changing an expression in a formula" and "changing the number of samples". If we use (n-1) instead of n in a formula for computing a statistic, this does not mean that we changed the number of samples.

If we are precise about the apparent contradiction you raise, I think it can be phrased this way:

In computing the sample variance for a sample of n things from population P1, with a statistic S, we use (n-1) in the denominator for S because using (n) instead would cause the population P2 of the values of S to have a mean less than the variance of P1. However, if we increase the number of samples n then we decrease the variance of the population of P2.

I see no contradictions in the above statements.
 

Related to Why Is Sample Variance Calculated with n-1 Instead of n?

What is sample variance?

Sample variance is a statistical measure that quantifies the spread or variability of a set of numerical data. It is calculated by finding the average squared difference of each data point from the mean of the data set.

Why is sample variance important?

Sample variance is important because it helps to understand the distribution and variability of a data set. It is used to make inferences about the population from which the sample was taken and to compare different data sets.

How do you calculate sample variance?

Sample variance is calculated by taking the sum of the squared differences of each data point from the mean, dividing it by the total number of data points minus one, and then taking the square root of the result.

What is the formula for sample variance?

The formula for sample variance is:
s^2 = Σ(x - x̄)^2 / (n - 1)
where s^2 is the sample variance, x is each data point, x̄ is the mean of the data set, and n is the total number of data points.

What is the difference between sample variance and population variance?

The main difference between sample variance and population variance is that sample variance is calculated using a subset of a larger population, while population variance is calculated using the entire population. Sample variance is used to estimate the population variance and is typically slightly smaller than the population variance.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
987
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
11
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
5K
  • Set Theory, Logic, Probability, Statistics
Replies
28
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
Back
Top