- #1
Sane
- 221
- 0
I have a 2x2 contingency table, and I want to discover how likely it is that the two events are dependent.
The top-left cell is usually in the range of 1-10. The bottom-right cell can be over 3 billion. The other two cells are in the hundreds or millions (exclusively).
I have tried Pearson's chi-squared test, and Yates correction for continuity, with varying degrees of success due to the large differences in magnitude of the entries. This also has the null-hypothesis reversed; I want to use the p-value to definitively 'find' dependence (potentially missing out on some for which there is no evidence), as opposed to finding independence and potentially misclassifying dependent events.
Fischer's Exact and Barnard's Exact tests look promising, but calculating the factorial of a number in the billions is just "impossible".
---
In case more overview is needed:
I have two binary events (let's say X and Y) for which I track occurrences throughout various "documents". Sometimes they occur together (A), one occurs without the other (B, C), or neither occur together (D). This creates my 2x2 contingency table, with A in the top-left and D in the bottom-right, and D being incredibly huge. There are N = A+B+C+D documents.
From this I want some measure of how likely it is that X is related to Y. I want to be able to know when X and Y are almost surely dependent.
I'd rather err on the side of caution and say that X and Y are not likely related when they actually are, as opposed to saying that they are likely related when they are not. In other words, a "score" of 0 might indicate no evidence of dependence, and 1 indicates overwhelming evidence that they are dependent.
Hope this makes sense.
The top-left cell is usually in the range of 1-10. The bottom-right cell can be over 3 billion. The other two cells are in the hundreds or millions (exclusively).
I have tried Pearson's chi-squared test, and Yates correction for continuity, with varying degrees of success due to the large differences in magnitude of the entries. This also has the null-hypothesis reversed; I want to use the p-value to definitively 'find' dependence (potentially missing out on some for which there is no evidence), as opposed to finding independence and potentially misclassifying dependent events.
Fischer's Exact and Barnard's Exact tests look promising, but calculating the factorial of a number in the billions is just "impossible".
---
In case more overview is needed:
I have two binary events (let's say X and Y) for which I track occurrences throughout various "documents". Sometimes they occur together (A), one occurs without the other (B, C), or neither occur together (D). This creates my 2x2 contingency table, with A in the top-left and D in the bottom-right, and D being incredibly huge. There are N = A+B+C+D documents.
From this I want some measure of how likely it is that X is related to Y. I want to be able to know when X and Y are almost surely dependent.
I'd rather err on the side of caution and say that X and Y are not likely related when they actually are, as opposed to saying that they are likely related when they are not. In other words, a "score" of 0 might indicate no evidence of dependence, and 1 indicates overwhelming evidence that they are dependent.
Hope this makes sense.
Last edited: