Interpreting Chi Squared .... backward

In summary, the chi-sqr statistic says that there is a statistically significant difference between "O" and "E" when it comes to the distribution of scores. However, the p-value is only around 0.96, which means that it's possible that the difference is not really that significant.
  • #1
ezfzx
51
15
OK, so, I've forgotten more statistics than my students will ever know, and I'm not too proud to ask for help, because I'm just blanking out on this. I would appreciate it if someone could patiently follow along and let me know what I've got right or wrong please.

My understanding of the chi-sqr is this:
I have two data sets: "observed" (O) and "expected" (E).
In this case, I'm comparing "O" data which is a sub-set of a much large "E" population.
I'm going to make this assumption, which I will call the "null" hypothesis: If the "O" data was a clean unbiased sub-sample from "E", the distribution in the "O" data set should match the "E" distribution. Yes?

For example, if I'm grabbing a sample of people from the general population, my sample should have the same percentage of each age group, racial group, education level, etc. of the general population. If my sample DOES NOT have this distribution, there may be a bias somewhere.

So, I take the difference between each O & E item pair, square it, divide by the expected (E) value, Σ up all the results and we have the chi-sqr (χ2) value. Use this and the degrees of freedom (D) to get a p-value.

The alpha level (α) is the traditionally acceptable cut off point in doctoral research, which we set to 0.05 (or 5%).

We can say this: "The probability of randomly drawing a sample that produces a chi-sqr value of χ2 with D degrees of freedom is p-value."

We compare the p-value to the α.
IF p-value > α, THEN we can claim NO statistically significant difference between "O" and "E".
IF p-value < α, THEN we CAN claim a statistically significant difference between "O" and "E".

So, here's MY "simple" interpretation problem.

If want to insure that I'm getting a clean unbiased sub-sample from the general population, I should WANT to see a strong correlation between "O" and "E".
Does that mean I WANT p-value > α?
Or maybe I'm misusing the chi-square?
Seems to me I do want p-value > α, but I got kinda turned around in my own box, so a confirmation or correction would be GREAT.

The reason I'm questioning myself is that I'm seeing a distribution in "O" that looks fairly different than "E", but I'm getting a p-value around 0.96, like the math's sayin' "Don't worry, dude. It's cool." ... but it doesn't LOOK cool. So how to I explain to someone else that what looks "off" is really "cool" when I'm not convinced myself? My faith in math is shaken!

Please phrase all answers using small words and talk to me like I'm stupid, because I haven't had my coffee yet. :)
 
Physics news on Phys.org
  • #2
If you have a 96% chance to get a worse agreement from a future unbiased sample, then you can't say: "the distribution from my current sample deviates significantly". And that's really all you have.

Give details if you want to have comments on your
ezfzx said:
what looks "off" is really "cool"
 
  • #3
Well, here's the heart of it. Before I even get into what I'm measuring about a sample of people, I want to see if my sample of people is really a representative sample.
For example, I want to see if a particular teaching style was effective with a particular group. Before I begin looking at the teaching methodology, I want to take a look at the group at the beginning of the class and make sure they represent the larger college population.
If they don't (if my sample's SAT scores are unusually high or low) that may play a role in how the teaching style was received.

So, I've got the demographics on a larger population (100,000+), and I have demographics on my sample (150 or so). I pick a characteristic, like SAT scores or something. I'm hoping the distribution of the SAT scores for my sample are NOT significantly different from those of the larger population.

I do a chi-sqr analysis of the two sets, with the scores from the larger group representing my "expected" values.

My interpretation is that I'm hoping for p-value > α, which is what I got.
In fact, when I look over the numbers, they do look like they are somewhat proportional. But then I get a few that spike up and look very out of place.
For example (not SAT scores):
Sample: "7.9", "2.4", "21.4", "3.2", "7.9", "3.2", "13.5", "1.6"
Population: "3.3", "6.2", "2.7", "4.1", "6.9", "4.6", "8.4", "4.4"
Excel's CHITEST functions gives me a p-value of 0.984. I didn't believe it, so I calculated it by hand ... same answer.
Just feels like the value should be lower.

I played with the values to see what it would take to get a p-value below 5%, and the numbers had to get WAY off.
Seems like that is a LOT of latitude for randomness, and a there could conceivably BE a bias in there, lost in the noise.

So, am I allowed to say, "the distribution of the sample did not vary significantly from the larger population" ... ?
I reluctantly believe so, but I'd like a validation from a second opinion.

Maybe there is a different statistical tool I should be using?
 
  • #4
not what I meant with
BvU said:
Give details if you want to have comments on your
because (must be my ignorance) I can't reproduce what you did step by step to get this.

I don't particularly care what these data do NOT represent (SAT data, apparently): What do these data represent ? How come sample adds up to 61 and pop to 41 ? Can't compare them or are the other bins spread out so wide they contain 1 or 0 ?

By the way -- the population data don't really show a form of distribution at all. Why is that ?​

How do you calculate that p value?

The way I learned it in 1971 was that you have a histogram of observed values ##n_o(i)## and for each bin ##i## in the histogram you have an expected number ##n_e(i)##. The standard deviation in that expected number is ##\sqrt n_e## so you combine bins until each has at least ##n_e \ge 10##.

Then sum over the bins: $$\chi^2 = \sum_i \left ( {n_o(i) - n_e(i)\;\over \sigma(i) }\right )^2$$and look up p in a table (nowadays on a calculator or -- if you know what you are doing :rolleyes: -- with an excel function ).

From the Excel Chisq.test help:
Use of CHISQ.TEST is most appropriate when Eij’s are not too small. Some statisticians suggest that each Eij should be greater than or equal to 5
 
Last edited:
  • #5
If the samples are small enough, chi test will tell you that there is no way that you can conclude that they are significantly different. That's all it means.
 
  • #6
( To have people check your implementation of chi-square, it would be best to present a simple numerical example. Edit: Ok, I see you've done that. )

Suppose I'm evaluating the effectiveness of two different sections (e.g. 8 AM and 3 PM) of Psychology 101 by using the mean score of each section of students on a commonly final exam. It may be correct that students were assigned to the two sections without any bias. However, it is likely that just by chance there will still be differences between the two sections of students. If the mean final exam score of students in 3 PM section is higher than the mean score of the students in the 8 AM section, we still have the question of whether the differences in the scores can be explained mostly by the differences in the two groups of students. This question arises even if students were randomly assigned to the two sections. So the important consideration isn't how the students are assigned to the two sections. The important consideration is what happened to the two sections of students after assignments were made.

Of course, it would be convenient to do a two step statistical analysis such as:
1) Argue students were assigned to the two sections without any bias
2) Argue the difference in mean final exam scores is (or is not) significant using statistical tests where the null hypothesis is that the students are assigned to sections without bias.

In this type of analysis, step 2 ignores the fact that the process of assigning students has produced two definite sections of students. Even if the method of assigning students was unbiased, there can be important differences in the two sections of students that one particular execution of the method created.

Applying statistics to real world problems is a subjective process. Custom and culture play a role. It may be that in your field, academic papers are commonly written using the above two step analysis.
 
  • #7
Thank you all for your quick and considered replies.

BvU said:
How do you calculate that p value?
For the example data given, using your terminology, assume each value in the list is a value for a bin.
Take the difference between each in turn [no(i) - ne(i)], square it, divide by ne(i), add all these up for the chi squared value. (This was how I was taught to do it.)
While I'm sure there's a more satisfyingly mechanical method, the p-value comes from using Excel's CHIDIST function with the χ2 and n-1 = 7 degrees of freedom. Excel's CHITEST function produces the same value.

I like Serena said:
If the samples are small enough
I don't suppose I could just multiply all the values by 100? Seems like all the ratios would still be the same.
I'm not seeing why that matters. What am I missing?
Should I be using a different test tool?

Stephen Tashi said:
it would be convenient to do a two step statistical analysis
Yes, that's basically what I'm doing. One research question relates to who is in the class. Another relates to the effect of the teaching style. So, it behooves me to make some kind of comment about deviations from the "normal" distribution, regardless of whether or not it affects the results.
 
  • #8
ezfzx said:
I don't suppose I could just multiply all the values by 100? Seems like all the ratios would still be the same.
I'm not seeing why that matters. What am I missing?
Should I be using a different test tool?

Let's compare it with a t-test since a t-test is easier to understand than a chi-square test.
We want to know if 2 samples are different.
Suppose we have the tiny samples {2, 6} (mean 4) and {1,9} (mean 5).
To tell if they are significantly different the difference between the sample means must be significantly larger than the standard error.
But with so few values, the standard error is very large (and the degrees of freedom is very low), meaning we have insufficient data to tell that they are significantly different (p=0.85).
Even worse, since both means fall well within the standard error, it appears as if they are really from the same population.
However, with bigger samples the standard error might still fall below 0.5, and with more degrees of freedom we will have a significant difference after all.

So we don't need a different test tool, we need more independent measurements to reduce the noise.
And still, it won't be conclusive.
 
  • #9
I see. (Well, I'll play with it and then I'll see.)
And I'm not expecting conclusive ... just confident. :)

Unfortunately, I can't get more data at this stage. And the nature of the data is largely nominal, a little bit ordinal, so I got medians and quartiles, rather than means and standard deviation/error.

I just want to know that if I'm looking at something shift way far left of center whether it's significant or not.
 
  • #10
ezfzx said:
Take the difference between each in turn [no(i) - ne(i)], square it, divide by ne(i), add all these up for the chi squared value. (This was how I was taught to do it.)
Good, we agree on that ! However:
(By now, I must be the one asking for assistance:)

I get a ##\chi^2## of 144, mainly due to the 130 from ## \ {(21.4-2.7)^2\over 2.7}\ ##, so you must be using a different number of ##n## ?

And: why 7 degrees of freedom if there are 8 bins ?

What do these data represent ? How come sample adds up to 61 and pop to 41 ?
Do you perhaps normalize to the 150 and 100000+ ?
 
  • #11
Oh yeah ... forgot to mention all the values are percentages ... so divide by 100.
It's my understanding degrees of freedom = N - 1, so if N = 8 ... ?
Each "bin" is the number of people that answered the survey a certain way.
I'm not expecting a distribution, since the responses fall on a nominal scale.
 
  • #12
ezfzx said:
Oh yeah ... forgot to mention all the values are percentages
O, even when I specifically asked... Not good. Not good at all.
ezfzx said:
so divide by 100
meaning that you then get a fraction. But for the standard deviation you want the numbers.

With the recipe in post #7:

For bin 1 you have 3.3 percent of the folks in the population giving the same answer as 12 folks in the sample. In a sample of 150 you therefore expect that answer to be given ##3.3 \times {150\over 100} = 4.95 ## times with a standard eviation of ## \sqrt {4.95} ##.
In the actual sample, you get that specific answer 7.9/100 * 150 = 12 times. The contribution to ##\chi^2## of that bin becomes ##\displaystyle {{(12 - 4.95)^2\over 4.95 } = 10}##. Continuing like this ##\chi^2 = ## 214, which, with a number of degrees of freedom of 8, gives a p-value of zero.

What ##\chi^2## do you find ? How ?
How do you find a p-value of 0.96 ? Please explain in smalll steps :rolleyes:
 
  • #13
\begin{array}{|c|c|c|c|}
\hline O & E & |O-E| & |O-E|^2 & |O-E|^2/E \\
\hline 0.079 & 0.033 & 0.04637 & 0.00215 & 0.06514 \\
\hline 0.024 & 0.062 & 0.03819 & 0.00146 & 0.02352 \\
\hline 0.214 & 0.027 & 0.18729 & 0.03508 & 1.29911 \\
\hline 0.032 & 0.041 & 0.00925 & 0.00009 & 0.00209 \\
\hline 0.079 & 0.069 & 0.01037 & 0.00011 & 0.00156 \\
\hline 0.032 & 0.046 & 0.01425 & 0.00020 & 0.00442 \\
\hline 0.135 & 0.084 & 0.05092 & 0.00259 & 0.03087 \\
\hline 0.016 & 0.044 & 0.02813 & 0.00079 & 0.01798 \\
\hline & & & sum = & 1.44469 \\
\hline
\end{array}

χ2obtained = 1.44469
deg of freedom = 7
α = 0.05
χ2critical = 14.0671

My understanding:
If χ2obtained < χ2critical (which it is), then we ACCEPT the null hypothesis, i.e. "We cannot predict O using E because there is no obvious relationship."
But here's where I believe I'm using the wrong tool.
I want to be able to say that the sample is a good representation of the population, i.e. there SHOULD be a relationship.
But even if χ2obtained > χ2critical, and we REJECT the null hypothesis and assume a relationship, nothing is telling us WHAT the relationship is.
It DOES NOT say that the sample is a good sub-set of the population. It only says the sample varies in some (unknown) distinctive way that could conceivably allow us to anticipate one using the other. For all we know, the correlation is negative.
So, I'm becoming less enchanted with using chi_sqr for this purpose.

So .. maybe the t-test? or just a regular correlation?
did I mention stats isn't my thing? (stop laughing!)
 
  • #14
You divide the results from the population poll by 100000 and then take the square root to use as the standard deviation. Look at bin 3: worst case 2700 of 100000 folks did what 32 out of 150 in the sample did.

The relative error in 2700 is ##\sqrt {2700}/2700 ## = 2%.

convert 2700 ##\pm## 52 out of 100000 to a sample size of 150 and you get 4.05 ##\pm## 0.078. The actiual 32 in the sample is therefore 360 sigma away and its contribution to chi square is a whopping 130000 and not the 1.3 you find (a factor 100000 -- recognize it ?)
 
  • #15
Ah, that helps. Good explanation.
OK ... stupid question #42 ... and it's not like I didn't Google this or peruse my various books first ... but ...
Why are we allowed to take the square root of our one and only data point and call it a standard deviation?
 
  • #16
And ... stupid question #43 ... how do I express that in English?
"The sample deviates from the larger population because ... ?"
 
  • #17
Answer #42 :smile: the underlying explanation is that there is a certain probability ##\alpha(i)## for a bin i to receive a 'hit' and the probability distribution for one single bin is a Poisson distribution (not a gausssian, because the probability is ##\ge## 0). For a Poisson distribution with probability ##\alpha(i)##the variance is ##\sigma_i^2 = {\alpha(i)}##.

And #43: because (*) the p-value (i.e. the probability to get a sample with greater deviation from the population distribution than the deviation actually found) is as good as zero.

(*) here you implicitly inert:
,under the hypothesis that the distribution found is an unbiased sample from the given population,​
 
  • #18
#42: got it. :woot:
#43: uh ... huh? Did I get a p-value? ?:)
So: p-value = odds of picking only the craziest people out of a larger population.

Let me see if I understand this:
I go through each of the bins, comparing the sample value in the bin to the population value in the bin.
These values, in fact, represent % of each of their respective populations which qualify for this bin.
Ex: for the sample, 0.079 is 7.9% of a sample pop of 150 ≈ 12.
Ex: for the population, 0.033 is 3.3% of the larger pop of 100,000 ≈ 3300.
I take the pop_value (3300), square root it can call that the "error" (and/or "standard deviation") ≈ 57
So, I'm now allowed to say: pop_value = 3300 ± 57
I have the option to divide this by the pop_value and get a relative error (a.k.a. "% error") = 57/3300 = 0.017 = 1.7%
And then I can say, "Well, hey ... what if this same % applied to the sample?"
And I can take 3.3% of the sample pop of 150 = 4.95
And the %error of THAT = (1.7%)(4.95) = 0.084
And, I'm now allowed to say, "If the population statistic were applied to the sample in this bin, it would be: 4.95 ± 0.08."
Right about now, I'm assuming that I'm free to assume that any sample value NOT within 0.08 of 4.95 is out of bounds.
But the actual sample in that bin was 12, which deviates from 4.95 by about 7, which is 87× bigger than the error of 0.08.
So that's WAY WAY out of bounds.

So, I proceed through each of the bins and find that a LOT of my sample values are WAY WAY out of bounds.
In FACT, NONE of them fall within 10 sigmas of "normal". And my gut says: That can't be right.
Now I'm thinking, "Dang! I think I DID accidentally pick all the craziest people in the population for my sample! How did THAT happen?"
But before I go off yelling about this statistical phenomenon, maybe I should look at the ODDS of picking THIS particular group of 150 deviants out of a pool of 100,000.

:eek:... this is where I get lost.
 
  • #19
ezfzx said:
#43: uh ... huh? Did I get a p-value? ?:)
Yes you did. Atleast four times:
ezfzx said:
I'm getting a p-value around 0.96
ezfzx said:
Excel's CHITEST functions gives me a p-value of 0.984
ezfzx said:
While I'm sure there's a more satisfyingly mechanical method, the p-value comes from using Excel's CHIDIST function with the χ2 and n-1 = 7 degrees of freedom. Excel's CHITEST function produces the same value.

ezfzx said:
any sample value NOT within 0.08 of 4.95 is out of bounds
No. It just isn't within one standard deviation from the expected value.
ezfzx said:
So that's WAY WAY out of bounds
Probability is very small, but not zero, though. You can look it up in a table or with an excel function.

---

Now it's time to correct my own writing: in posst #14 I'm doing wrong what I did right in post #12. Sorry ...
BvU said:
convert 2700 ± 52 out of 100000 to a sample size of 150 and you get 4.05 ± 0.078.
So in that bin a hit has a probability of 4/150 for a sample of 150. The corresponding standard deviation is ##\sqrt{} ## 4 = 2 and not 0.078. The observed 32 hits is still a hefty 14 stdev away.

----

You politely follow my post #14 error, but protest
ezfzx said:
And my gut says: That can't be right
Hurray for your intuition.
Code:
sample 12 4 32 5 12 5 20 2
pop 4.95 9.3 4.05 6.15 10.35 6.9 12.6 6.6
sigma in pop 2.2 3.0 2.0 2.5 3.2 2.6 3.5 2.6
weighted residual 3.2 1.7 13.9 0.5 0.5 0.7 2.1 1.8
wr^2 10.0 3.0 192.9 0.2 0.3 0.5 4.3 3.2
Sum wr^2 = chisquare 214.5
with p-value aound 10-42 if n = 7 ...
 
  • #20
Well, my assumption was that those p-values were related to the chi squared test and not related to this other new thing we were doing.
BvU said:
The observed 32 hits is still a hefty 14 stdev away.
Well, I guess I feel a little better ... but not really.
I'm still struggling with how to describe it.
"The odds of these values being selected randomly are remote, therefore the odds are good that some influencing factor is involved."
Certainly a p-value of 10-42 (being about a billion times smaller than quantum foam) is less that 0.05.
So, maybe it's OK to accept the alternate hypothesis.
"These values of the sample vary significantly from the general population."

(Weighted residual? Where are the values you're using from? Those weren't mine ...)
 
  • #21
Wait ... I just re-read your last post ...
When we did the
"convert 2700 ± 52 out of 100000 to a sample size of 150 and you get 4.05 ± 0.078."
Then you said "No wait ... it's 4.05 ± 2.01 because 2.01 is the square root of 4.05."
But ... I thought the goal was to find the margin of error in the larger pop (2700 ± 52), and then replicate the RELATIVE ERROR in the sample?
52/2700 = 0.078/4.05 = 1.9% = relative error.
See, THAT made sense.
2.01/4.05 = 50% error!

I'm trying to look at it like this:

100000 people fall into one of eight bins. (Not literally.)
If it's unbiased, each bin should have 12500 people.
THAT'S the null hypothesis.

Step 1: See if the general pop fits the null hypothesis.

Answer: It would seem not when I have some bins with only 2700 people.
The sqrt(12500) = 112, and so 2700 is (12500-2700)/112 = 87.5 stdev's away.

So, for the gen pop, the alternative hypothesis is supported: there is a built-in bias for the gen pop.

Step 2: Does the sample MATCH that bias?

The null hypothesis now becomes, "Yes, the sample distribution matches the population in taht it has a matching bias."

For example, if 35% of my population is under 40 yrs old, I'm expecting 35% of my sample to also be under 40 yrs old.

So, we look at the population bias: 2700 ± 52 out of 100000
Anything more than a few stdev's away will be suspicious/unlikely.
That same RELATIVE error in the sample translates to 4.050 ± 0.078 out of 150.
Which would imply that pretty much ANY value other than "4" in that sample bin would be suspicious.

Even if we go with your revision and say: 4 ± 2, there's not a lot of wiggle room there.

I guess when you pull just 150 out of 100000, there are SO MANY other variables at play on such a tiny sample that getting a representative distribution is virtually impossible.
 

FAQ: Interpreting Chi Squared .... backward

What is Chi Squared?

Chi Squared is a statistical test used to determine if there is a significant relationship between two categorical variables. It is often used in hypothesis testing to determine if there is a difference between observed and expected frequencies.

What is the purpose of interpreting Chi Squared backward?

Interpreting Chi Squared backward allows researchers to determine the expected frequencies based on the observed frequencies and the null hypothesis. This can help to verify if the observed frequencies are significantly different from what would be expected by chance.

How do you interpret Chi Squared backward?

To interpret Chi Squared backward, you must compare the expected frequencies to the observed frequencies. If the expected frequencies are close to the observed frequencies, then there is no significant difference between the variables. However, if the expected frequencies are significantly different from the observed frequencies, then there may be a relationship between the variables.

What are the limitations of interpreting Chi Squared backward?

Interpreting Chi Squared backward can only determine if there is a significant relationship between two categorical variables, it cannot determine the direction or strength of the relationship. It also assumes that the sample is representative of the population and that the variables are independent.

How can Chi Squared backward be used in research?

Chi Squared backward can be used in research to analyze data and determine if there is a significant relationship between two categorical variables. It can also be used to test hypotheses and make conclusions about the relationship between variables. Additionally, it can be used to identify any potential confounding variables that may affect the relationship between the variables being studied.

Similar threads

Replies
7
Views
285
Replies
1
Views
1K
Replies
7
Views
2K
Replies
5
Views
4K
Replies
5
Views
3K
Replies
5
Views
2K
Back
Top