[University Introductory Statistics] DNA crime scene

In summary: If you have a population of 4999999 where 49 people have the specific DNA signature, what is the probability that none of them are in a (randomly chosen) subset of 29999 of the 4999999 people?I'm sorry, but I still don't understand.
  • #1
eskimotaro
15
1
Hello everyone. I have been given a problem in my Introductory Mathematical Statistics class. Been thinking about this one for a while and I am simply stuck.

1. Homework Statement

"There has been found a DNA of type S on a crime scene. We will assume a total population of N = 5000000 that are potential contributors to the lead. Next assume there is a DNA-database consisting of n = 30000 individuals. Also assume that there are M = 50 individuals in the whole population that have a DNA of type S."

There are six sub-questions (a)-(f), and I am stuck on (d)-(f). I will simply explain what questions (a)-(c) are, and then write up questions (d)-(f).

2. The attempt at a solution [part 1]

In (a) we let X = the number of individuals with type S in the database. Here I am to find the probability distribution of X. I think that the sample space must be x = {0, 1, 2, ..., 50}. To calculate the distribution of x I have used MATLAB and a hypergeometric distribution formula. That was no problem.

In (b) I am to use a binomic distribution formula instead to calculate the probability distribution of X, that was also not much of a problem.

For (c) I am just asked to calculate P(X = 1), which was just to take the relevant calculation from (a) or (b). P(X = 1) is approximately 0.22.

3. Sub-questions

Here are the sub-questions (d)-(f) which I am stuck on:

"(d) Assume that every individual in the population have the same likelihood of being a contributor. Let A be the event that the contributor is one of the individuals in the database. Calculate P(A).

(e) Find P(X = 1 | A).

Hint: When we know that the contributor is in the database, then there are M - 1 = 49 left who we do not know is in the database or not. Argue that we then are interested in the probability that none of these are in the database.

(f) Find P(A | X = 1). Argue that this corresponds to the probability that the individual with matchin DNA profile in the database is the culprit."

4. The attempt at a solution [part 2]

I have just not been able to get past these questions. For (d) I think that P(A) might be 1/30000, because that's simply how I interpret the question.

So I would be forever grateful if anyone could give me tips on how to solve this. Excuse my language if anything is unclear; English is my second language.
 
Physics news on Phys.org
  • #2
For (d), you can ignore the DNA completely. Your contributor is one random individual out of 5 million. What is the probability that this individual is within the database of 30000?

For (e), you have 4999999 individuals that have to be arranged somehow. The problem is similar to (c).
 
  • Like
Likes eskimotaro
  • #3
Thank you for your reply. :)

For (d) I guess the answer must be P(A) = 30000/5000000 = 3/500?

Not sure about (e) though. P(X = 1 | A) means the randomly selected person has the DNA, given that he is a contributor? Does it mean the answer is 29999/4999999? The hint sort of confuses me.
 
  • #4
eskimotaro said:
For (d) I guess the answer must be P(A) = 30000/5000000 = 3/500?
I think so.
eskimotaro said:
Not sure about (e) though. P(X = 1 | A) means the randomly selected person has the DNA, given that he is a contributor?
No, (A) already includes that the contributor (who has the DNA by definition) is in the database. If that 3/500 event happens, how likely is it that no one else in the database has the same DNA signature?
 
  • Like
Likes eskimotaro
  • #5
Hmm. So if no one else in the database are to have the same DNA, then they can not be among the 50 that have S. But we have already included one. So there are 49 persons with S that need to be outside the database?

That's 29999 left in the database, and we need to exclude 49? But the rest of the population also need to be arranged. Meaning (29999 - 49) / 4999999?
 
  • #6
eskimotaro said:
So there are 49 persons with S that need to be outside the database?
Sure.
eskimotaro said:
That's 29999 left in the database, and we need to exclude 49? But the rest of the population also need to be arranged. Meaning (29999 - 49) / 4999999?
I don't understand where that term comes from.

If you have a population of 4999999 where 49 people have the specific DNA signature, what is the probability that none of them are in a (randomly chosen) subset of 29999 of the 4999999 people?
 
  • Like
Likes eskimotaro
  • #7
mfb said:
If you have a population of 4999999 where 49 people have the specific DNA signature, what is the probability that none of them are in a (randomly chosen) subset of 29999 of the 4999999 people?

Gotta admit I'm sort of lost here. So if we have 49 persons with DNA S in a population of 4999999. What's the probability that none of them are among the 29999 left in the database? Should that be (4999999 - 29999) / 4999999?

The hint says that we are somehow interested in the probability of the 49 not being in the database, so I'm thinking that number needs to be incorporated somehow.

Truly appreciate your help here!
 
  • #8
eskimotaro said:
What's the probability that none of them are among the 29999 left in the database? Should that be (4999999 - 29999) / 4999999?
No.
You solved the same problem in (a) and (c) already, just with slightly different numbers.
 
  • Like
Likes eskimotaro
  • #9
mfb said:
You solved the same problem in (a) and (c) already, just with slightly different numbers.

Oh, do you suggest I use the hypergeometric or binomial formula again? So if I use a binomial formula with p = 49/4999999 and n = 29999 I get the following:

[itex]29999*(\frac{49}{4999999})^{1}*\left(1-\frac{49}{4999999}\right)^{29999-1}[/itex]

Which equals about 0.22 again.
 
  • #10
Sure. But you need the probability that no one is in the sample, instead of 1 (what you calculated).
The X=1 comes from your criminal already.
 
  • Like
Likes eskimotaro
  • #11
Then it doesn't seem like there's much of a difference between P(X = 1 | A) and P(X = 1). Would you say that's correct? Scratch that I wrote before your edit.

mfb said:
But you need the probability that no one is in the sample, instead of 1 (what you calculated).
The X=1 comes from your criminal already.

Ah, of course. So instead I should look for the probability of there being 0, but with the numbers excluding the 1 criminal?

[itex]29999*(\frac{49}{4999999})^{0}*\left(1-\frac{49}{4999999}\right)^{29999}[/itex]

Which is 0.7453.

(f) I might be able to figure out using Bayes' theorem I think.
 
Last edited:
  • #12
(f) Probably, but there is also a shorter direct approach. You can even do both to cross-check earlier results.
 
  • Like
Likes eskimotaro
  • #13
I think I have thought too much about this problem lately, I'm not sure if what I did above even is correct. But it does make sense I think.

mfb said:
(f) Probably, but there is also a shorter direct approach. You can even do both to cross-check earlier results.

I'm thinking

[itex]P( A \mid X = 1) = \frac{P(X = 1 \mid A)*P(A)}{P(X = 1 \mid A)*P(A) + P(X = 1 \mid A^{c})*P(A^{c})}[/itex]

Then I need to find

[itex]P(X = 1 \mid A^{c})[/itex]

And that seems a bit tricky.
 
Last edited:
  • #14
There are 50 people with the right DNA. One of them is in the database. What is the probability that this one is the criminal?
 
  • #15
mfb said:
There are 50 people with the right DNA. One of them is in the database. What is the probability that this one is the criminal?

Is this for the (e) or (f) question? I'm sorry the language confuses me sometimes.

EDIT:

Should I try to figure our the probability that 49 samples in the database are not type S? Can I use:

[itex]P(S^{c}) = 1 - \frac{50}{5000000}[/itex]

somehow?
 
Last edited:
  • #16
eskimotaro said:
Is this for the (e) or (f) question? I'm sorry the language confuses me sometimes.
For (f). It should be a simple question.

I have 50 apples, 49 of them are green and one of them is red. I give you a random apple. What is the probability that the apple is red?

Should I try to figure our the probability that 49 samples in the database are not type S? Can I use:

[itex]P(S^{c}) = 1 - \frac{50}{5000000}[/itex]

somehow?
I don't understand why you combine numbers like that.
Out of the 30000 samples, most samples (at least 29950...) are not of type S.
 
  • Like
Likes eskimotaro
  • #17
mfb said:
I have 50 apples, 49 of them are green and one of them is red. I give you a random apple. What is the probability that the apple is red?

This one is simply [itex]\frac{1}{50}[/itex]. But I'm not sure what to do with that. This presupposes that I am already choosing among the ones I know have DNS type S. Or is that exactly the meaning of P(A | X = 1)?

EDIT: P(A | X = 1) is the probability that the criminal is chosen, given that we are already choosing among the individuals with DNA type S. So yes, then P(A | X = 1) must be [itex]\frac{1}{50}[/itex], or am I understanding it wrong? I thought I was supposed to be using Bayes' on (f).

EDIT 2: Question (f) has a note which reads: "Here your answer may differ, depending on if you have used numerical values for P(A), P(X = 1 | A), and P(X = 1), or if you have done the calculation algebraically expressed by N, M and n."

Which makes me think that P(A | X = 1) can not be just [itex]\frac{1}{50}[/itex]?

EDIT3: I've been thinking more about (e) now.

P(X = 1 | A) is the probability of there being exactly one person with type S in the database, given that the culprit is in the database. Doesn't that mean that P(X = 1 | A) = P(X = 1)?
 
Last edited:
  • #18
eskimotaro said:
This presupposes that I am already choosing among the ones I know have DNS type S. Or is that exactly the meaning of P(A | X = 1)?
Out of 50 people with the DNA signature, your DNA database has exactly one person (X=1). What is the probability that your criminal (one out of 50 with the DNA signature) is this person?
Yes, it is exactly what you need.

eskimotaro said:
P(X = 1 | A) is the probability of there being exactly one person with type S in the database, given that the culprit is in the database. Doesn't that mean that P(X = 1 | A) = P(X = 1)?
No. The knowledge that the criminal is in the database does influence the distribution.
As a more striking example, compare P(X = 0 | A) and P(X = 0).
 
  • Like
Likes eskimotaro
  • #19
mfb said:
Out of 50 people with the DNA signature, your DNA database has exactly one person (X=1). What is the probability that your criminal (one out of 50 with the DNA signature) is this person?
Yes, it is exactly what you need.

If there's is exactly one person with DNA type S in the database, then the probability of picking that person is [itex]\frac{1}{30000}[/itex]. That's the answer then I believe? Is that also what you are hinting at?

[itex]P(A | X = 1) = \frac{1}{30000}[/itex]

mfb said:
No. The knowledge that the criminal is in the database does influence the distribution.
As a more striking example, compare P(X = 0 | A) and P(X = 0).

Will have to think more about this one.
 
  • #20
eskimotaro said:
If there's is exactly one person with DNA type S in the database, then the probability of picking that person is [itex]\frac{1}{30000}[/itex]. That's the answer then I believe? Is that also what you are hinting at?
That would be the answer to "if you pick a random person out of the database, what would be the probability to pick a specific one". That is not the problem statement.
 
  • Like
Likes eskimotaro
  • #21
mfb said:
That would be the answer to "if you pick a random person out of the database, what would be the probability to pick a specific one". That is not the problem statement.

But you're saying that the answer isn't [itex]\frac{1}{50}[/itex] either? Given that there is exactly one person with DNA type S in the database, and knowing that there are 50 individuals with type S, the probability that the individual with matching DNA is the culprit is [itex]\frac{1}{50}[/itex]?

EDIT: Or do you say that I have to use the result from (d) [itex]P(A) = \frac{3}{500}[/itex] somehow?

EDIT2:
mfb said:
No. The knowledge that the criminal is in the database does influence the distribution.
As a more striking example, compare P(X = 0 | A) and P(X = 0).

About that. Given that the contributor is in the database, there could also be further samples of S-type DNA in the database, right? The question is asking for the probability of there being only one sample of S-type DNA in the database (which must be from the contributor since that is given). The result P(X = 1| A) = P(X = 1) obviously cannot be applied to the case where P(X = 0), given A, the reason being that event A is defined as 'the contributor is one of the individuals in the database'. The knowledge that the criminal is in the database influences the probability distribution of X only to the extent that X = 0 is no longer part of the distribution.
 
Last edited:
  • #22
eskimotaro said:
But you're saying that the answer isn't [itex]\frac{1}{50}[/itex] either?
It is 1/50.

About that. Given that the contributor is in the database, there could also be further samples of S-type DNA in the database, right?
In general, yes.

The question is asking for the probability of there being only one sample of S-type DNA in the database (which must be from the contributor since that is given). The result P(X = 1| A) = P(X = 1) obviously cannot be applied to the case where P(X = 0), given A, the reason being that event A is defined as 'the contributor is one of the individuals in the database'. The knowledge that the criminal is in the database influences the probability distribution of X only to the extent that X = 0 is no longer part of the distribution.
That result is wrong. It is an approximation, but not exact for any X.
Imagine it would be exact: We know that P(X = 0| A) + P(X = 1| A) + ... + P(X = 50| A) = 1 (because one of the cases has to be true), but we also know that P(X=0) + P(X=1) + ... + P(X=50) = 1. If one of them is not equal (as we established for X=0), then something else has to differ as well. And all of them differ.
 
  • Like
Likes eskimotaro
  • #23
mfb said:
That result is wrong. It is an approximation, but not exact for any X.
Imagine it would be exact: We know that P(X = 0| A) + P(X = 1| A) + ... + P(X = 50| A) = 1 (because one of the cases has to be true), but we also know that P(X=0) + P(X=1) + ... + P(X=50) = 1. If one of them is not equal (as we established for X=0), then something else has to differ as well. And all of them differ.

Alright, I see. So I thought more about (e) based on what you said earlier and the hint. Using a hypergeometric formula, I calculated this:

[itex]\frac{({49}\ C \ {0})*({4999950}\ C \ {29999})}{({4999999}\ C \ {29999})}[/itex]

Which [itex]\approx 0.7446[/itex]
That's the probability that 0 of the 49 left are in the database.

EDIT:

Sorry but I have another question regarding (f). Since A is the event that the contributor is one of the individuals in the database, and given that there is only one individual with DNA type S in the database. Does that not mean that the contributor is indeed in the database, so that:

[itex]P(A\ \mid \ X = 1) = 1[/itex]?
 
Last edited:
  • #24
Your text would fit to ##P(A\ \mid \ X = 1 \land A) = 1## which is true and trivial.
##P(A\ \mid \ X = 1)## is not one because you could have someone else with the DNA signature in the database, and your criminal not in the database.
 
  • Like
Likes eskimotaro
  • #25
mfb said:
Your text would fit to ##P(A\ \mid \ X = 1 \land A) = 1## which is true and trivial.
##P(A\ \mid \ X = 1)## is not one because you could have someone else with the DNA signature in the database, and your criminal not in the database.

Ah, that makes a lot of sense! Guess I misunderstood it.

What do you think about my calculation for (e)? That's the probability that 0 of 49 are not in the database. Which is what the hint says that we're interested in. Is that what I am looking for?
 
  • #27
What I mean is, the probability that 0 of 49 people with DNA type S not being in the database should be that same as what I am looking for? The probability that there is exactly one person with DNA type S, given that the contributor is in the database. Since I have already account for the 1 person not being outside by calculating what I did above.
 
  • #28
eskimotaro said:
What I mean is, the probability that 0 of 49 people with DNA type S not being in the database should be that same as what I am looking for?
There is a duplicate negation, but I guess you mean the right thing.
eskimotaro said:
Since I have already account for the 1 person not being outside by calculating what I did above.
Right.
 
  • #29
Fantastic! Then I guess I am done. Thank you so much for your help. Truly appreciate it.

The Physics forums seems like a very interesting place, I think I will stick around and lurk. :)
 
  • Like
Likes mfb

FAQ: [University Introductory Statistics] DNA crime scene

1. What is DNA evidence and how is it collected at a crime scene?

DNA evidence is genetic material that is left behind at a crime scene and can be used to identify a suspect or victim. It is collected by forensic experts through various methods such as swabbing surfaces for skin cells, collecting hair or blood samples, or even retrieving saliva from items like cigarettes or cups.

2. How is DNA analysis used to solve crimes?

DNA analysis, also known as DNA profiling or DNA fingerprinting, is used to compare DNA samples collected from a crime scene to potential suspects or known individuals. This can help determine if a suspect was present at the scene, link multiple crime scenes to a single perpetrator, or exclude innocent individuals from suspicion.

3. What are the limitations of using DNA evidence in criminal investigations?

The main limitation of DNA evidence is that it can only be used if there is a DNA sample available at the crime scene. This means that in cases where there is no physical evidence left behind, such as in some cases of arson, DNA evidence may not be useful. Additionally, DNA samples can become contaminated or degraded, leading to inaccurate results.

4. Can DNA evidence be falsified or manipulated?

While it is possible for DNA evidence to be falsified or manipulated, it is highly unlikely due to the strict protocols and procedures in place for collecting and analyzing DNA samples. Forensic experts must follow specific guidelines to prevent contamination or tampering of samples, and any discrepancies in the results can be detected through quality control measures.

5. How is statistical analysis used in interpreting DNA evidence in court cases?

Statistical analysis is used to determine the likelihood that a DNA match between a suspect and a crime scene sample occurred by chance. This is often presented in court as a statistic known as the "match probability," which indicates the probability of finding the same DNA profile in a randomly selected individual. The lower the match probability, the stronger the evidence is against the suspect.

Back
Top