Probability calculation involving very large numbers

In summary: most...important...of these, the answer to question #2 is the most important, followed by #1, then #3.
  • #1
Matt2
2
0
Hi, I'm trying to figure out how to compute probability related to a problem I am tackling for work, and I think I have a handle on how to do it with smaller numbers, but no idea how to approach it for larger numbers. (And I need to explain the answers to a judge in simple terms). So here is the problem:

Imagine a company that maintains data about 200,000,000 Americans. Each month, this company takes a completely random sample of 5% of these reports to analyze. However, we are concerned only with 60,000 specific people out of this group of 200,000,000.

So we can visualize this as 60,000 "red marbles" and 199,940,000 "white marbles". Assuming that these combined 200,000,000 marbles are placed into a very large container and that 10,000,000 are selected randomly. I am trying to calculate:

1) The chance that none of the 10,000,000 marbles will be red;

2) The chance that 40 or more of the 10,000,000 marbles will be red;

3) The chance that all of the possible 60,000 red marbles will be included in the 10,000,000 selected.

Of these, the answer to question #2 is the most important, followed by #1, then #3.

Does anyone have an idea on where to start? I thought maybe it would make it easier to simply remove 4 zeroes from each number so that we are working with 20,000 / 6 / 1,000 but it seems that this skews the results. Would appreciate a pointer in the right direction.

Thanks!
Matt
 
Mathematics news on Phys.org
  • #2
Matt said:
So we can visualize this as 60,000 "red marbles" and 199,940,000 "white marbles". Assuming that these combined 200,000,000 marbles are placed into a very large container and that 10,000,000 are selected randomly. I am trying to calculate:

1) The chance that none of the 10,000,000 marbles will be red;

Hi Matt! Welcome to MHB! :)

Suppose you pick 1 marble.
What is the chance that it won't be red?
What if you pick 2 marbles?
Or 10?
Or 10,000,000?
 
  • #3
I like Serena said:
Hi Matt! Welcome to MHB! :)

Suppose you pick 1 marble.
What is the chance that it won't be red?
What if you pick 2 marbles?
Or 10?
Or 10,000,000?
the chance of 1 marble being white is 199,940,000 / 200,000,000
the chance of the 2nd marble being white would seem to be 199,939,999 / 199,999,999.

No idea after that. I became a lawyer because I wasn't good at math. :) And have tried a couple online math tutors who have told me that the answer to my questions cannot be calculated given the size of the numbers. Appreciate any help!
 
  • #4
Matt said:
the chance of 1 marble being white is 199,940,000 / 200,000,000
the chance of the 2nd marble being white would seem to be 199,939,999 / 199,999,999.

No idea after that. I became a lawyer because I wasn't good at math. :) And have tried a couple online math tutors who have told me that the answer to my questions cannot be calculated given the size of the numbers. Appreciate any help!

Let's give those numbers a name.
Let's define $N=200,000,000$, $n=199,940,000$, and $k=10,000,000$.

So the chance of 1 marble being white is:
$$P(1\text{ white}) = \frac n N$$
For 2 marbles we get:
$$P(2\text{ white}) = \frac n N \frac{n-1}{N-1} = \frac {n(n-1)}{N(N-1)}$$
For $k$ marbles we get:
$$P(k\text{ white}) = \underbrace{\frac {n(n-1)...(n-k+1)}{N(N-1)...(N-k+1)}}_{k\text{ factors}}$$Alternatively, we can use the general formula:
$$P = \frac{\text{Number of favorable outcomes}}{\text{Total number of outcomes}}$$

The number of favorable outcomes is the number of ways we can choose $k$ white marbles from the $n$ white marbles.
This is $n \choose k$.

The total number of outcomes is the number of ways we can choose $k$ marbles marbles from the total of $N$ marbles.
This is $N \choose k$.

So:
$$P(k\text{ white}) = \frac{n \choose k}{N \choose k}$$
 
  • #5
Matt said:
Hi, I'm trying to figure out how to compute probability related to a problem I am tackling for work, and I think I have a handle on how to do it with smaller numbers, but no idea how to approach it for larger numbers. (And I need to explain the answers to a judge in simple terms). So here is the problem:

Imagine a company that maintains data about 200,000,000 Americans. Each month, this company takes a completely random sample of 5% of these reports to analyze. However, we are concerned only with 60,000 specific people out of this group of 200,000,000.

So we can visualize this as 60,000 "red marbles" and 199,940,000 "white marbles". Assuming that these combined 200,000,000 marbles are placed into a very large container and that 10,000,000 are selected randomly. I am trying to calculate:

1) The chance that none of the 10,000,000 marbles will be red;

2) The chance that 40 or more of the 10,000,000 marbles will be red;

3) The chance that all of the possible 60,000 red marbles will be included in the 10,000,000 selected.

[snip]
The answers to your questions are
1) Effectively zero (less than 10^-1336)
2) Effectively one (more than 1 - 40 * 10^-1246)
3) Effectively zero (less than 10^-78136)

Let's start by assigning names to some of your numbers:
N = 200,000,000
K = 60,000
n = 10,000,000
Then if p(x) is the probability the sample of n marbles will contain exactly x red marbles, then
$$p(x) = \frac{\binom{K}{x} \binom{N-K}{n-x}}{\binom{N}{n}}$$
where $\binom{n}{m} = \frac{n!}{m! (n-m)}$ is the number of ways to choose m items out of n, also known as a "binomial coefficient". See Hypergeometric distribution - Wikipedia, the free encyclopedia.

The trouble is, as you have already pointed out, it's not practical to calculate p(x) in this form with numbers as large as you have given; so we must resort to some tricks. We will calculate the logarithm of p(x) instead of calculating p(x) directly. This will enable us to deal with much smaller numbers in the intermediate calculations, so they can be done, for example, in double precision floating point, or in Excel, which uses double precision. So the first trick is to take the logarithm to the base e = 2.71828 of the equation for p(x):
$$\ln(p(x)) = \ln\binom{K}{x} + \ln \binom{N-K}{n-x} - \ln\binom{N}{n}$$
The second trick is to use a mathematical function available in Excel for the evaluation of the logarithms of the binomial coefficients. We have
$$\ln \binom{j}{k} = \ln \left( \frac{j!}{k! \; (j-k)!} \right) = \ln(j!) - \ln(k!) - \ln((j-k)!)$$
so we need a convenient way to evaluate $\ln(t!)$ for large values of $t$. Fortunately, Excel provides a function GAMMALN, defined by $GAMMALN(t) = \ln( \Gamma (t))$, where $\Gamma(t)$ is the Gamma function. (See Gamma function - Wikipedia, the free encyclopedia.) Since $t! = \Gamma(t+1)$ for a positive integer $t$, we have $GAMMALN(t+1) = \ln(t!)$.

If we put this all together and evaluate ln(p(0)) in an Excel spreadsheet, we find $\ln(p(0)) = -3078.07$, so $$\log_{10}(p(0)) = \frac{\ln(p(0))}{\ln(10)} = -1336.79$$ This shows that $p(0) < 10^{-1336}$, the answer to your first question.

For question 2), note that the probability that the sample of n marbles will not contain at least 40 red marbles is
$$\sum_{x=0}^{39} p(x)$$
If we go through the same steps as above to evaluate $\log_{10}(p(x))$ for $x = 0, 1, 2, \dots , 39$, we find the largest number in the sequence is $\log_{10}(p(39)) = -1246.62$. This shows $p(x) < 10^{-1246}$ for $x = 0, 1, 2, \dots , 39$, so
$$\sum_{x=0}^{39} p(x) < 40 \cdot 10^{-1246}$$ Since this is the probability that the sample does not contain at least 40 red marbles, the probability that the sample does contain at least 40 red marbles is greater than $1 - 40 \cdot 10^{-1246}$.

For question 3), we use the same method to evaluate the logarithm of p(60,000), and we find $\log_{10}(p(60,000)) = -78136.22$, so $p(60,000) < 10 ^{-78136}$, which is a small number indeed.

There are other ways to approach the problem. For example, we could approximate the Hypergeometric distribution with a Normal distribution. But I don't know how to bound the error of the approximation, so that method might be less convincing in court.

[edit] I changed some of the variable names, because in some cases I had used the same name for two different purposes in the original post. I hope this version is less confusing.[/edit]
 
Last edited:

FAQ: Probability calculation involving very large numbers

What is probability calculation involving very large numbers?

Probability calculation involving very large numbers is a mathematical concept used to determine the likelihood of a certain event occurring when dealing with a large sample size. It involves using statistical methods to analyze and interpret data to make predictions about future outcomes.

What is the importance of probability calculation involving very large numbers?

Probability calculation involving very large numbers is important because it allows scientists to make accurate predictions and decisions based on large sets of data. It is commonly used in fields such as genetics, finance, and weather forecasting.

What are some common methods used for probability calculation involving very large numbers?

There are several methods used for probability calculation involving very large numbers, including the Law of Large Numbers, Central Limit Theorem, and Monte Carlo simulation. These methods involve using mathematical formulas and computer simulations to analyze large datasets and make predictions.

How is probability calculation involving very large numbers used in scientific research?

Probability calculation involving very large numbers is used in scientific research to analyze and interpret data, make predictions, and test hypotheses. It is commonly used in fields such as biology, physics, and social sciences to understand complex systems and make informed decisions.

What are some challenges in performing probability calculation involving very large numbers?

Performing probability calculation involving very large numbers can be challenging due to the complexity of the data and the need for accurate and reliable methods. It also requires a deep understanding of statistical concepts and advanced mathematical skills. Additionally, large datasets can be time-consuming and computationally intensive to analyze.

Back
Top