Class Project: Comparing 2 Population Means & Std Devs

In summary: Remember that $\text{Var(x+a)} = \text{Var(x)}$, so this wouldn't change the variance at all. However if it increases all over by 5% then it would follow $\text{Var (1.05x)} =1.1025 \text{ Var(x)}$ since $\text{Var (ax)} = a^2\text{Var (x)}$. If global warming affects the data unevenly, then some values will stay in place and some values will spread out more, which will increase the spread overall and the variance. To show
  • #1
sf0
7
0
For our class project we are to collect some data from two populations, summarize it in table and graphical form, and conduct at least two different hypothesis tests that use different procedures and construct two different confidence intervals. I wanted to summarize my plan here and ask for feedback and suggestions.

I will compare temperature data from 60 random counties in the USA for the years 1964 and 2014. My two populations are the 1964 average temperatures for the 60 random counties, and the 2014 average temperatures for the same 60 random counties. I wish to compare the population means and population standard deviations. Since the data is matched, these are dependent samples. The textbook says when planning an observational study, dependent samples with paired data is generally better. However, in the section on comparing standard deviations, it says the two populations must be "independent". So my main question is, does this mean I should not be using matched pairs? Any other feedback or suggestions are also appreciated.
 
Mathematics news on Phys.org
  • #2
sf0 said:
For our class project we are to collect some data from two populations, summarize it in table and graphical form, and conduct at least two different hypothesis tests that use different procedures and construct two different confidence intervals. I wanted to summarize my plan here and ask for feedback and suggestions.

I will compare temperature data from 60 random counties in the USA for the years 1964 and 2014. My two populations are the 1964 average temperatures for the 60 random counties, and the 2014 average temperatures for the same 60 random counties. I wish to compare the population means and population standard deviations. Since the data is matched, these are dependent samples. The textbook says when planning an observational study, dependent samples with paired data is generally better. However, in the section on comparing standard deviations, it says the two populations must be "independent". So my main question is, does this mean I should not be using matched pairs? Any other feedback or suggestions are also appreciated.

Hi sf0, (Wave)

Welcome to MHB!

This is a good question. When you are doing a paired sample and testing for the mean difference, you will only have one standard deviation to consider. Each pair will be combined into a single value through subtraction so you'll have $n/2$ values. This is a good test to run for the mean difference but as for comparing variances, I'm not so sure. Usually we use the F-test to do that, and as you said this requires independent populations.

I think doing it paired and unpaired is smart. When you run the unpaired test you can also check for the equality of variance through the F-test, with some necessary assumptions.
 
  • #3
Jameson said:
Hi sf0, (Wave)

Welcome to MHB!

This is a good question. When you are doing a paired sample and testing for the mean difference, you will only have one standard deviation to consider. Each pair will be combined into a single value through subtraction so you'll have $n/2$ values. This is a good test to run for the mean difference but as for comparing variances, I'm not so sure. Usually we use the F-test to do that, and as you said this requires independent populations.

I think doing it paired and unpaired is smart. When you run the unpaired test you can also check for the equality of variance through the F-test, with some necessary assumptions.

Thanks for your help!

I had another question, about interpretation. If global warming makes weather more extreme, then we would expect the variance for the 2014 data to be higher, right? But if global warming is more pronounced the closer you get to the poles, then we would expect the variance for the 2014 data to be lower, right? So which is true? Or is it both?
 
  • #4
sf0 said:
Thanks for your help!

I had another question, about interpretation. If global warming makes weather more extreme, then we would expect the variance for the 2014 data to be higher, right? But if global warming is more pronounced the closer you get to the poles, then we would expect the variance for the 2014 data to be lower, right? So which is true? Or is it both?

I'd have to look at the data and think about it but let's say that the temperatures in 2014 are all 3 degrees higher than in 1964. Remember that $\text{Var(x+a)} = \text{Var(x)}$, so this wouldn't change the variance at all. However if it increases all over by 5% then it would follow $\text{Var (1.05x)} =1.1025 \text{ Var(x)}$ since $\text{Var (ax)} = a^2\text{Var (x)}$. If global warming affects the data unevenly, then some values will stay in place and some values will spread out more, which will increase the spread overall and the variance. To show that global warming is correlated with location though, you'd want to some regression analysis. So basically "it depends". I'd begin by plotting the data in the two years and trying to spot any differences to begin investigating.
 
  • #5
Jameson said:
I'd have to look at the data and think about it but let's say that the temperatures in 2014 are all 3 degrees higher than in 1964. Remember that $\text{Var(x+a)} = \text{Var(x)}$, so this wouldn't change the variance at all. However if it increases all over by 5% then it would follow $\text{Var (1.05x)} =1.1025 \text{ Var(x)}$ since $\text{Var (ax)} = a^2\text{Var (x)}$. If global warming affects the data unevenly, then some values will stay in place and some values will spread out more, which will increase the spread overall and the variance. To show that global warming is correlated with location though, you'd want to some regression analysis. So basically "it depends". I'd begin by plotting the data in the two years and trying to spot any differences to begin investigating.

Thanks again! I have another question ...

What is the difference between comparing two dependent samples of the same 60 counties from 1964 and 2014, and comparing two independent samples of 30 counties from 1964 and 30 different counties from 2014? Are they two ways of measuring the same thing? Or do they measure two different things?
 
  • #6
sf0 said:
Thanks again! I have another question ...

What is the difference between comparing two dependent samples of the same 60 counties from 1964 and 2014, and comparing two independent samples of 30 counties from 1964 and 30 different counties from 2014? Are they two ways of measuring the same thing? Or do they measure two different things?

We match pairs often times when doing a clinical trial to get a "before and after" snapshot of each individual. Doing this can greatly reduce the confounding the apparent results with some other characteristic of a certain patient. In your example, if a cluster of countries started out with really hot or really cold temps then that could skew the data if using unmatched groups. Once your sample size gets large enough though with two random samples, the random effects of certain extreme cases should start to decrease. So basically with a large enough sample they should approach similar conclusions. I'm very curious with 60 countries in each group if you reach the same result. :)
 
  • #7
I have written my presentation, is anyone willing to read it for accuracy? Is it possible to attach a Word document? Or I can email it.
 
  • #8
sf0 said:
I have written my presentation, is anyone willing to read it for accuracy? Is it possible to attach a Word document? Or I can email it.

I think you can attach a Word document directly here. I'll be glad to look it over. Make sure to remove any identifying info about yourself please. :)
 
  • #9
Sorry, the file size is too big. Can I email it to you? Or I'll just copy-paste the text here.

I decided to learn statistics because of my interest in Climate Change. In a 2010 BBC interview, a prominent climate scientist admitted that the warming trend since 1995 to “the present” was “not statistically significant”. This led to headlines saying that warming since 1995 was “insignificant”. Now that you have taken this class you should know that in statistics, the term “statistically significant” has a special meaning, and “not statistically significant” does not mean “insignificant”. It means the evidence does not meet the 95% confidence level. The headlines were false. Furthermore, collecting additional evidence may increase our confidence level, and in fact one year later the trend reached the 95% confidence level, and even the correctly stated claim was no longer true.

For this project I wanted to collect some temperature data and see what I could find. I wanted a sample that represented the whole US. There are 3144 counties and county equivalents in the US. My initial plan was to randomly pick counties from this list and collect the average temperature for each in the years 1964 and 2014. There are two things you should notice immediately about this plan. First, it only includes the US, which will not tell us much about the rest of the world. Second, it only compares two years, and to identify a long term trend we would want to compare data over a period of many years.

It turns out that you can’t look up average annual temperature by county. Each county has multiple weather stations, and furthermore weather stations come and go. Because I wanted to compare the data for the same location from 1964 and 2014, I needed dependent samples. So I ended up looking for one weather station in each county that was present in both 1964 and 2014. This is a form of stratified sampling.

Here are the locations:

[slide]

I want to point out that there are less dots in the Western US. This is because counties are larger in the West. Since I only picked one station in each county, the stations are spread out more in the West. I will come back to this point later.

Now that I have my data, the first thing I want to do is use the differences from the two dependent samples to test a claim about the mean of the population of all such differences.

[slide]

So the p-value is way too high. Actually, the sample mean for population 1 (average temperature from 1964) is higher than the sample mean for population 2 (average temperature from 2014). So I should be testing for the opposite.

[slide]

So this is a surprise. It is backwards. Why is that?

Remember that I have fewer dots in the Western US. Maybe this is affecting the data. Also remember that I am only looking at 1964 and 2014. Maybe there is something special about one of those years.

[slide]

Well it turns out there is something special about 2014. In 2014 the contiguous United States experienced extremes of both hot and cold. The West had record heat, while the states in blue were exceptionally cold. Since I have few dots in the West, most of my data is from counties in areas that were exceptionally cold.

This means that whether the average temperature at a location went up or down is dependent on whether it is in the West or not. The formal way to test for independence is a contingency table.

[slide]

Now I know that I should be looking at the Western states separately from the rest of the US, and I need to collect more data from the West.

I selected some counties in the West, and this time I included all the weather stations in them that were present in both 1964 and 2014, although I threw out some that were too close together. I split my data into Western states and the rest of the US. Here are the locations:

[slide]

Now I test the claim that, in the West, average temperatures from 1964 are less than average temperatures from 2014:

[slide]

We have 95% confidence that the true value of the mean of the differences in temperature is between −3.319 and −1.758 degrees F.

And I test the claim that, in the Rest, average temperatures from 1964 are greater than average temperatures from 2014:

[slide]

We have 95% confidence that the true value of the mean of the differences in temperature is between 1.403 and 3.418 degrees F.

It is claimed that Climate Change will lead to more extreme weather, higher highs and lower lows. The map of extreme temperatures seems to show that. We measure the relative spread between highs and lows with the standard deviation. Here are the histograms:

[slide]

Actually the standard deviation for the whole country did not change much between 1964 and 2014. Why is that? If you take a closer look at the data, you will see that the mean temperature of the Western states is lower than the mean temperature of the whole country, and the mean temperature of the rest of the country is higher than the mean temperature of the whole country. Since the Western states got warmer and the rest of the country got colder, the extremes canceled out. But if you look at the standard deviation within the Western states and the standard deviation within the rest of the country separately, you can see that they both went up.

It is also claimed that Climate Change is warming the Arctic faster than anywhere else. I interpret this to mean that colder regions will warm faster. So we can see that both claims are true: within smaller regions there will be more extremes, but they will average out over larger regions.

I wanted to do more precise analysis of the variance of the data, but the tools we have require normal distributions, and you can see in the histograms the data is far from normal.

I enjoyed this project because it let me discover something I didn’t know before. And now that I have taken this class I have some new tools for checking the facts behind the headlines.
 
  • #10
Can you maybe upload this to Dropbox or Google drive? It would be easier to read as a powerpoint than this formatting. I appreciate the effort though. Let me know!
 
  • #11
Jameson said:
Can you maybe upload this to Dropbox or Google drive? It would be easier to read as a powerpoint than this formatting. I appreciate the effort though. Let me know!

Does this work?

https://drive.google.com/file/d/0B6RgyyDT3THOeFJkNkVIQklTbTg/view?usp=sharing
 
  • #12
sf0 said:
Does this work?

https://drive.google.com/file/d/0B6RgyyDT3THOeFJkNkVIQklTbTg/view?usp=sharing

Yes!

Ok I briefly read it over and my first impression is that it's very good. I think your thought process from start to finish is very coherent and the remarks you made are spot on. You started off with the correct alternative hypothesis ($\mu_1-\mu_2<0$) and then nicely commented on how the opposite result is somehow being demonstrated actaully. I really like how you accounted for sampling in the west affecting the overall data and then redoing the analysis. (Yes)

The only thing I might include is a small screenshot of your data. I like to briefly show a table of what it looks like in Excel or R, just to demonstrate I'm actually working with raw data and not just copying some other results.

Is this for a high school or college course? Is it a stats class or another sort of class just out of curiosity?
 
  • #13
Jameson said:
Yes!

Ok I briefly read it over and my first impression is that it's very good. I think your thought process from start to finish is very coherent and the remarks you made are spot on. You started off with the correct alternative hypothesis ($\mu_1-\mu_2<0$) and then nicely commented on how the opposite result is somehow being demonstrated actaully. I really like how you accounted for sampling in the west affecting the overall data and then redoing the analysis. (Yes)

The only thing I might include is a small screenshot of your data. I like to briefly show a table of what it looks like in Excel or R, just to demonstrate I'm actually working with raw data and not just copying some other results.

Is this for a high school or college course? Is it a stats class or another sort of class just out of curiosity?

This is a course at the local community college, it is a stats class. BTW, when I was in high school we didn't have stats classes, when did this start?
 
  • #14
sf0 said:
This is a course at the local community college, it is a stats class. BTW, when I was in high school we didn't have stats classes, when did this start?

It's usually AP Stats I think if it's offered in high school, so it's still not part of the core curriculum. I was just curious. The way you wrote the report definitely indicates a college level. :) Again, I think it's well done.
 

FAQ: Class Project: Comparing 2 Population Means & Std Devs

What is the purpose of comparing two population means and standard deviations in a class project?

The purpose of this class project is to analyze and compare two populations to determine if there is a significant difference between their means and standard deviations. This allows researchers to make inferences about the populations and draw conclusions about any potential relationships between them.

How do you calculate the mean and standard deviation of a population?

The mean of a population is calculated by adding all the values in the population and dividing by the total number of values. The standard deviation is calculated by finding the difference between each value and the mean, squaring those differences, finding the average of those squared differences, and then taking the square root of that average.

What is the difference between a sample and a population?

A population refers to the entire group or set of individuals or objects that are of interest in a study. A sample, on the other hand, is a smaller subset of that population that is selected for analysis. In statistical analysis, samples are used to make inferences about the larger population.

How do you determine if the difference in means and standard deviations between two populations is statistically significant?

To determine if the difference in means and standard deviations between two populations is statistically significant, you can perform a t-test. This test compares the means of the two populations and calculates the probability of obtaining these results by chance. If the probability is below a certain threshold, usually 0.05 or 0.01, then the difference is considered statistically significant.

What are some potential limitations of comparing two population means and standard deviations?

One potential limitation is that the samples used may not be representative of the entire population. This could lead to inaccurate conclusions about the populations. Additionally, the t-test assumes that the populations have a normal distribution, which may not always be the case. Finally, there may be other factors at play that could affect the means and standard deviations, so it is important to consider other variables and potential confounding factors before drawing conclusions.

Similar threads

Replies
1
Views
2K
Replies
5
Views
3K
Replies
1
Views
2K
Replies
1
Views
2K
Replies
1
Views
2K
Replies
9
Views
2K
Back
Top