Probabilistic Factors Involved in Disease and Virus Testing
Table of Contents
Introduction
This Insight looks at the various probabilistic factors and related terminology involved in disease and virus testing.
As we all know, tests are rarely 100% reliable. The frequency of false positives and false negatives, however, not only depend on the tests themselves but also on the prevalence of the disease or virus within the population. To see this, imagine the two extremes where a) no one has the virus, and b) everyone has the virus. In the first case, all positives must be false. And, in the second, all negatives must be false.
This motivates for doing a proper analysis of the probabilities involved to see more precisely what can be concluded from a test result given all the available data.
Note that this insight provides a simple probabilistic analysis. In many practical cases, some or all of the data is unknown, which leads to the more advanced techniques of hypothesis testing.
We assume throughout that we have a single test for a virus.
Terminology
The relevant terminology cannot be avoided:
Prevalence (##D##): the proportion of the population (or the subgroup being tested) who have the virus. There are two possible scenarios here. First, random testing of the population or group, where the prevalence is some generic likelihood that someone in that group has the virus (and doesn’t suspect it). Second, testing within a group who have come forward because of some suspicion that they may have the virus.
In general, the prevalence will be higher in the second case, so it’s important to distinguish between these two cases and use the best estimate in each case.
In this Insight, we will use ##D## to denote the prevalence within the relevant population.
Positive Predictive Value (PPV) (##x##): the probability of having the virus given a positive test. Note that as explained in the introduction this is not a fixed value, but depends on the prevalence, which itself may depend on the particular group or individual being tested.
In this Insight, we will use ##x## to denote the PPV.
Negative Predictive Value (NPV) (##y##): the probability of not having the virus given a negative test. As with PPV, this depends on the prevalence.
In this Insight, we will use ##y## to denote the PPV.
Sensitivity (##p##): the probability of a positive test given the subject has the virus. This probability is fixed for a given test and doesn’t depend on the prevalence.
Specificity (##q##): the probability of a negative test given the subject does not have the virus. This also is independent of the prevalence.
With that standard terminology out of the way, we can begin to analyze how these quantities are related.
Analysis Based on Prevalence
The group to be tested will have a (possibly unknown) proportion ##D## who have the virus, and a proportion ##1-D## who do not have the virus. In each case two test results are possible, based on the sensitivity and specificity, which results in four categories in the following proportions:
##Dp##: those who have the virus and tested positive (these are true positives)
##D(1-p)##: those who have the virus and tested negative (these are the false negatives)
##(1-D)q##: those who do not have the virus and tested negative (true negatives)
##(1-D)(1-q)##: those who do not have the virus and tested positive (false positives)
For simplicity, we introduce a further variable here, which is the proportion of positive tests ##T##:
$$T = Dp + (1-D)(1-q)$$
We can now express the PPV and NPV by reading off the data above (this is equivalent to using Bayes’ Theorem):
To calculate the PPV we find the number of positive tests (##T##) and the number of those who have the virus – which is ##Dp##. The PPV (##x##) is the conditional probability of having the virus given a positive test, which is:
$$x = \frac{Dp}{T}$$
We may also read off the NPV, which is the conditional probability of not having the virus given a negative test:
$$y = \frac{(1-D)q}{1-T}$$
Note that $$1 – T = D(1-p) + (1-D)q$$
Applying this Analysis
To do something useful with the above analysis (perhaps in the context of a new test), we first need a group who we know has the virus and a group who we know do not have the virus. By applying the test in each case we can calculate the sensitivity ##p## and specificity ##q## for that particular test.
In addition, if we know (or can reasonably well estimate) the prevalence of the virus (##D##), then we can interpret the result of an individual test as a probability of that person having or not having the virus. These are just the PPV and NPV as above. For those who return a positive test, we have:
$$x = \frac{Dp}{T} = \frac{Dp}{Dp + (1-D)(1-q)}$$ is the probability they have the virus. And, of course, ##1-x## is the probability they do not.
And, for those who return a negative test, we have:
$$y = \frac{(1-D)q}{1-T} = \frac{(1-D)q}{(1-D)q + D(1-p)}$$ is the probability they do not have the virus. And, ##1-y## is the probability they do.
To take an example. Suppose ##p = 0.9##, ##q = 0.95## and ##D = 0.1## is an estimated prevalence. Then:
##x = \frac{Dp}{Dp + (1-D)(1-q)} = 0.667##
##y = \frac{(1-D)q}{(1-D)q + D(1-p)} = 0.988##
We can see that someone with a negative test almost certainly does not have the virus; whereas, someone who tested positive has only a probability of ##2/3## of actually having the virus.
We can now see the effect of changing the prevalence by taking ##D = 0.5##. This might represent the scenario where a group of people with certain symptoms are being tested and are more likely to have the virus than those in a random sample of the population. Then:
##x = 0.947##
##y = 0.905##
We see that in this case, the positive test has become more conclusive (nearly 95% likelihood), while the negative test result is now less conclusive (still a 10% chance of having the virus). This illustrates the importance of prior suspicion of the virus, as the conclusion depends heavily on the estimated prevalence.
Analysis Based on Test Results
We may also analyze the relationship between these quantities based on the outcome of test results. We can look at the proportion who tested positive (##T##) and negative (##1- T##); and, subdivide these based on PPV (##x##) and NPV (##y##). This again gives four categories:
##Tx##: Those who have a positive test and the virus (true positives)
##T(1-x)##: Those who have a positive test but do not have the virus (false positives)
##(1-T)y##: Those who have a negative test and do not have the virus (true negatives)
##(1-T)(1-y)##: Those who have a negative test but do have the virus (false negatives)
We can then express the prevalence, sensitivity, and specificity in terms of these:
$$D = Tx +(1-T)(1-y)$$$$p = \frac{Tx}{D} = \frac{Tx}{Tx + (1-T)(1-y)}$$$$q = \frac{(1-T)x}{1-D} = \frac{(1-T)y}{(1-T)y + T(1-y)}$$
These equations may, of course, be derived directly from the previous set by some algebra. It’s nice, however, to see how easily they are extracted from a simple probabilistic analysis.
In truth, I’m not sure how useful these reciprocal formulas may be, but there they are.
Formulas for False Positives and Negatives
By equating the proportions of true and false positives and negatives from each analysis above, we get four more formulas with no additional effort:
$$D(1-p) = (1-T)(1-y) \ \ \ [\text{false negatives}]$$$$(1-D)(1-q) = T(1-x) \ \ \ [\text{false positives}]$$$$Dp = Tx \ \ \ \ [\text{true positives}]$$$$(1-D)q = (1-T)y \ \ \ [\text{true negatives}]$$
Conclusion
What we have derived here, with relative ease and no significant algebra or calculations, is a general set of formulas that relate all the relevant quantities in such a way that any particular problem can be solved using them. Whatever data is given (PPV, NPV, sensitivity, specificity, prevalence, or proportion of positive tests), then the remaining data may be calculated simply and directly from these formulas.
Post-Script: Bayes Theorem
Bayes’ Theorem is implicity the basis for reading off the conditional probabilities in the above analysis. Bayes’ Theorem is:
$$P(B)P(A|B) = P(A)P(B|A) \ \ (1)$$
An easy proof is simply to note that both sides of equation ##(1)## equal ##P(A \cap B)##, which is the probability of having both ##A## and ##B##.
The more familiar form is, of course:
$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$
To see how this relates to our terminology, note that in Bayes’ notation, the PPV (##x##) is:
$$x = P(virus|+ test) = \frac{P(+ test|virus)P(virus)}{P(+test)}$$
Where ##P(+ test|virus) = p##, the sensitivity; ##P(virus) = D##, the prevalence; and, ##P(+test) = T##, the proportion of positive tests.
It’s possible, therefore, to generate all the formulas above using the algebraic form of Bayes’ Theorem. And, indeed, this is generally the way the subject is taught – even though there seems much less scope for going wrong using our “probability tree” approach.
BSc in pure mathematics (1984). Retired from a career in Information Technology in 2014. I divide my time between studying physics when I’m home in London and mountaineering.
Favourite area of physics is Quantum Mechanics.
I think a couple of graphs would be useful though, like the PPV/NPV as a function of prevalence for a couple of values of p/q, to really drive the message home.