Calculating probablity that random subset of population contains duplicates

mads1 · Mar 29, 2012

Hi,

Apologies that this is basic question but I have to start somewhere! (-:

The problem is succinctly stated in the msg title but, in greater detail; I'm working with some biological data from which samples have been taken. The sampling should have been at random. The samples include duplicates. What I need to know is how to calculate the expected number of duplicates in a sample size drawn from a population size.

For example, if I have a population size, p, of 3 million, and take 3 million samples, s, then the extent of duplicates within the samples s would be expected to be greater than if I take 300thousand samples.

But how do I calculate the expected rate given various values of p and s?
I have access to R & should be able to find my way to any libraries which might be helpful in answering this. Thanks

m

Mr Fantastic · Mar 29, 2012

mads said:

Hi,

Apologies that this is basic question but I have to start somewhere! (-:

The problem is succinctly stated in the msg title but, in greater detail; I'm working with some biological data from which samples have been taken. The sampling should have been at random. The samples include duplicates. What I need to know is how to calculate the expected number of duplicates in a sample size drawn from a population size.

For example, if I have a population size, p, of 3 million, and take 3 million samples, s, then the extent of duplicates within the samples s would be expected to be greater than if I take 300thousand samples.

But how do I calculate the expected rate given various values of p and s?
I have access to R & should be able to find my way to any libraries which might be helpful in answering this. Thanks

m

If I understand the problem correctly, then I think you should take a look at the hypergeometric distribution (use your preferred search engine).

awkward · Mar 30, 2012

Hi Mads,

What do you mean by a "duplicate"? Do you mean its like you caught a fish, threw if back into the lake, and then caught the same fish again? Or is it like catching another fish of the same species? And to pursue the fishing analogy further, do you return the fish to the lake ("sampling with replacement"), or do you keep it ("sampling without replacement")?

Calculating probablity that random subset of population contains duplicates

FAQ: Calculating probablity that random subset of population contains duplicates

What is the formula for calculating the probability of a random subset containing duplicates?

Can you provide an example of how to use the formula to calculate the probability?

How does the size of the population and subset affect the probability of duplicates?

Is there a way to decrease the probability of duplicates in a random subset?

Are there any real-world applications for calculating the probability of duplicates in a random subset?

Similar threads

Hot Threads

Recent Insights