Exploring Frequentist Probability vs Bayesian Probability
Table of Contents
Confessions of a moderate Bayesian, part 2
Read Part 1: Confessions of a moderate Bayesian, part 1
Bayesian statistics by and for non-statisticians
https://www.cafepress.com/physicsforums.13280237
Background
One of the continuous and occasionally contentious debates surrounding Bayesian statistics is the interpretation of probability. For anyone familiar with my posts on this forum, I am not generally a big fan of interpretation debates. This one is no exception. So I am going to present both interpretations as factually as I can, and then conclude with my take on the issue and my approach.
Probability axioms
Probability is a mathematical concept that is applied to various domains. I think that it is worthwhile to point out the mathematical underpinnings in at least a brief and non-rigorous form.
The axioms of probability that are typically used were formulated by Kolmogorov. He started with a complete set of “events” forming a sample space and a measure of that sample space called the probability of the event. Then probability is defined by the following axioms:
- The probability of any event in the sample space is a non-negative real number.
- The probability of the whole sample space is 1.
- The probability of the union of several mutually exclusive events is equal to the sum of the probabilities of the individual events.
Anything that behaves according to these axioms can be treated as a probability. I have glossed over some of the technical details of setting up the sample space and the events, and also it is worth noting that the third axiom can be written in terms of a countably infinite union or a finite union.
Randomness
It is important to recognize that nothing in the axioms of probability requires randomness. That is, the mathematical concept of probability is used to analyze randomness, but that is an application of probability, not probability itself.
Similarly, vectors are used to represent the outcome of a measurement of some quantity like velocity, but nothing in the mathematical definition of a vector requires velocity. Velocity is an application of vectors just as randomness is an application of probability.
Frequentist probabilities
In typical introductory classes, the concept of probability is introduced together with the notion of a random variable that can be repeatedly sampled. A good example is the outcome of flipping a coin. It doesn’t matter too much if we consider a coin-flipping system to be inherently random or simply random due to ignorance of the details of the initial conditions on which the outcome depends. Either way, we can perform the physical experiment of flipping a coin and we can observe that the result of the experiment is either a head or a tail.
Now, to apply the axioms of probability to this we need to construct a sample space. That is rather easy, our sample space can be ##\{H,T\}## where ##H## is the event of getting heads on a single flip and ##T## is the event of getting tails on a single flip.
Now, we need a way to determine the measure ##P(H)##. For frequentist probabilities, the way to determine ##P(H)## is to repeat the experiment a large number of times and calculate the frequency that the event ##H## happens. In other words, if you do ##N## trials and get ##n_H## heads then $$P(H) \approx \frac{n_H}{N}$$ for large ##N## with equality for a hypothetical infinite ##N##. So a frequentist probability is simply the “long-run” frequency of some event.
This has some nice features. First, it is objective; anyone with access to the same infinite set of data will get the same number for ##P(H)##. Second, it follows the axioms above, so you can either use ##P(H)## and the axioms to calculate ##P(T)## or you can use your data set to get the long-run frequency of tails ##n_T/N##.
It also has some problematic features, the worst of which is the long-run frequency. It is not realistic to get an infinite set of data even for something as inexpensive as flipping a coin, let alone for more expensive experiments where a single data point may cost thousands of dollars and years. The best you can do is get an approximation to ##P(H)## and sometimes that approximation can be quite bad.
Bayesian probabilities
The Bayesian concept of probability is more about uncertainty than about randomness. Remember, randomness is an important application of probability, not probability itself. Of course, if something is random, then we will be uncertain about it, but we can be uncertain about things that we don’t consider to be random.
For example, the value of the gravitational constant ##G## in SI units. We wouldn’t generally think of that as being random, but we also do not know it with certainty. We can therefore treat our uncertain knowledge of ##G## as a Bayesian probability. Some of the terminologies remain from the frequentist usage, so we may even call ##G## a random variable, although a purist (which I am not) may insist on calling it a parameter.
Bayesian probabilities obey the standard axioms of probability, so they are full-fledged probabilities, regardless of whether they describe true randomness or other uncertainty. Often they are described in terms of subjective beliefs, however “belief” in this sense is formalized in a way that requires “beliefs” to follow the axioms of probability. This is not how the psychological phenomenon of belief always works.
Bayes’ theorem
From the axioms of probability, it is relatively straightforward to derive Bayes’ theorem from whence Bayesian probability gets its name and its most important procedure: $$P(A|B)=\frac{P(B|A) \ P(A)}{P(B)}$$
For science we usually choose ##A=\text{hypothesis}## and ##B=\text{data}## so that $$P(\text{hypothesis}|\text{data}) = \frac{P(\text{data}|\text{hypothesis}) \ P(\text{hypothesis})} {P(\text{data})}$$ This gives us a way of expressing our uncertainty about scientific hypotheses, something that doesn’t make sense in terms of frequentist probability. As importantly, it tells us how to update our scientific beliefs in the face of new evidence.
In this equation ##P(\text{hypothesis})## is the probability that describes our uncertainty in the hypothesis before seeing the data, called the “prior”. ##P(\text{hypothesis}|\text{data})## is our uncertainty in the hypothesis after seeing the data, called the “posterior”. Both are probabilities so they each have probability distribution functions etc.
The frequentist vs Bayesian conflict
For some reason, the whole difference between frequentist and Bayesian probability seems far more contentious than it should be, in my opinion. I think some of it may be due to the mistaken idea that probability is synonymous with randomness. The Bayesian use of probability seems fundamentally wrong to someone who equates the two. But since both types of probability follow the same axioms, mathematically they are both valid and theorems that apply for one apply for the other. In particular, Bayesians don’t have some sort of exclusive rights to Bayes’ theorem.
The uncertainty should be the same as the long-term frequency once you have accumulated that infinite amount of data. And since you never have that infinite amount of data you will always have some uncertainty remaining. So the two types of probability are also complementary to each other. Furthermore, as we have seen, Bayesian methods give us ##P(\text{hypothesis}|\text{data})## and frequentist methods focus on ##P(\text{data}|\text{hypothesis})##, which are also complementary.
Summary
Just as I am not a fan of rigid adherence to scientific interpretations, I am also not a fan of rigid adherence to interpretations of probability. In both cases, I think that it is far more beneficial to learn multiple interpretations and switch between them as needed. When one is particularly suited to a given problem, then use that, and when the other is more suitable then switch. Just as different scientific interpretations produce the same experimental results so they can be used interchangeably, similarly the different interpretations of probability both follow the same axioms and can be used largely interchangeably. They are equivalent in that sense.
I hope this overview has given you a basic understanding of the differences between Bayesian and frequentist probabilities and perhaps a better understanding of the distinction between probability and randomness. Perhaps the odd contention between adherents of these two interpretations can eventually be dismissed as more people become familiar with both and use each when appropriate.
Continue to part 3: How Bayesian Inference Works in the Context of Science
Education: PhD in biomedical engineering and MBA
Interests: family, church, farming, martial arts
Leave a Reply
Want to join the discussion?Feel free to contribute!