Need statistical test for contingency table with very small and big counts

In summary: The Phi coefficient is better suited for this problem because it accounts for the dependence of the data. The other correlation measures may be better suited for other problems.
  • #1
Sane
221
0
I have a 2x2 contingency table, and I want to discover how likely it is that the two events are dependent.

The top-left cell is usually in the range of 1-10. The bottom-right cell can be over 3 billion. The other two cells are in the hundreds or millions (exclusively).

I have tried Pearson's chi-squared test, and Yates correction for continuity, with varying degrees of success due to the large differences in magnitude of the entries. This also has the null-hypothesis reversed; I want to use the p-value to definitively 'find' dependence (potentially missing out on some for which there is no evidence), as opposed to finding independence and potentially misclassifying dependent events.

Fischer's Exact and Barnard's Exact tests look promising, but calculating the factorial of a number in the billions is just "impossible".

---

In case more overview is needed:

I have two binary events (let's say X and Y) for which I track occurrences throughout various "documents". Sometimes they occur together (A), one occurs without the other (B, C), or neither occur together (D). This creates my 2x2 contingency table, with A in the top-left and D in the bottom-right, and D being incredibly huge. There are N = A+B+C+D documents.

From this I want some measure of how likely it is that X is related to Y. I want to be able to know when X and Y are almost surely dependent.

I'd rather err on the side of caution and say that X and Y are not likely related when they actually are, as opposed to saying that they are likely related when they are not. In other words, a "score" of 0 might indicate no evidence of dependence, and 1 indicates overwhelming evidence that they are dependent.

Hope this makes sense.
 
Last edited:
Physics news on Phys.org
  • #2
Sane said:
From this I want some measure of how likely it is that X is related to Y. I want to be able to know when X and Y are almost surely dependent.

I'd rather err on the side of caution and say that X and Y are not likely related when they actually are, as opposed to saying that they are likely related when they are not. In other words, a "score" of 0 might indicate no evidence of dependence, and 1 indicates overwhelming evidence that they are dependent.

Hope this makes sense.

You might already realize this, but the type of statistics you are talking about ( "frequentist") does not quantify the probability of any idea about the data, such as that two things are independent or not. The numbers that it computes are essentially "the probability of the data given the assumption of some ideas", not "the probability of some ideas given the data". The approach of frequentist statistics is that if the probability of the data given the hypothesis is "small" (which is a subjective judgment) then the procedure is to "reject" the assumed ideas. If the probability of the data is, say, 0.05 given that a certain hypothesis is assumed, this does not imply that the probability that hypothesis is false is 0.95.

Example: I saw a man dumping a body in a lake. Is he a murderer?
Null hypothesis: The man is a murderer.
Computation: The chances that a murder will dispose of a body by dumping it in a lake are only .02.
Conclusion: We reject the null hypothesis. The man is not murderer.
 
  • #3
Thanks for the reply. I was hoping I could have the null hypothesis set up such that I would be rejecting the null-hypothesis of dependence. This should permit me to know when there is evidence to suggest that the two variables are dependent, right? If that is the case, then my earlier comment about Pearson's being the opposite of what I want was incorrect.

Is the Phi Coefficient of the Yates corrected statistic better suited for my problem? What about some of the other correlation measures?
 
Last edited:

Related to Need statistical test for contingency table with very small and big counts

1. What is a contingency table?

A contingency table is a type of visual representation that shows the relationship between two categorical variables. It is also known as a cross-tabulation or crosstab for short.

2. Why do we need a statistical test for contingency tables with small and big counts?

In order to determine if there is a significant relationship between the two variables in a contingency table, we need to use a statistical test. This is especially important for tables with small and big counts, as it helps to account for any potential chance variation in the data.

3. What is the most commonly used statistical test for contingency tables with small and big counts?

The most commonly used test for this type of contingency table is the chi-square test. This test compares the observed frequencies in the table to the expected frequencies, and determines if there is a significant difference between the two.

4. How do you interpret the results of a statistical test for a contingency table?

If the p-value from the test is less than the chosen significance level (usually 0.05), then we can reject the null hypothesis and conclude that there is a significant relationship between the two variables. If the p-value is greater than the significance level, then we fail to reject the null hypothesis and cannot conclude that there is a significant relationship.

5. Are there any assumptions for using a statistical test for contingency tables?

Yes, there are a few assumptions that need to be met in order to use a statistical test for contingency tables. These include having independent observations, having expected frequencies of at least 5 in each cell of the table, and having a large enough sample size. Violating these assumptions can lead to inaccurate results.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
3K
  • Beyond the Standard Models
Replies
10
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
8K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
2K
  • Special and General Relativity
Replies
3
Views
1K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
3
Views
455
  • Advanced Physics Homework Help
Replies
7
Views
1K
  • Precalculus Mathematics Homework Help
Replies
3
Views
4K
Back
Top