- #1
geo101
- 56
- 0
I’m working on ways to try and assess if data selection methods are capable of isolating accurate results (from a control data set) with respect to random selection and to compare the relative performances of different methods.
In the data set we have multiple specimens and each specimen yields multiple results. So in terms of random selection, if randomly selecting 1 result from specimen #1 has a probability P1 of being an accurate result, if we randomly select 1 result from all specimen (P1, P2, P3, … Pn, for n specimens), what is the probability (Pf) of having an accurate result in this final data set?
Now suppose we select the data in some way (that we hope/think/pray will reject inaccurate results). At the specimen level the probabilities become P1′, P2′, P3′, … Pm′, where m <= n. What this means, is that some specimens yield no acceptable results. The probability of having an accurate result in this final data set is now Pf′.
How would we assess if our selection process is increasing our chances of obtaining an accurate result? What is the best way to compare Pf and Pf′ (or Pf′ and Pf′′ obtained from two different selection processes), and what factors should we consider?
This is where I get into a philosophical debate with one of my colleagues (neither of us are statisticians). His argument is that as long as Pf′ > Pf the data selection is an improvement. My argument is that the significance of the difference between Pf and Pf′ depends on m and that smaller differences require larger m to be important.
His view is that it doesn't matter what m, is as long as the final result is accurate. My opinion is, that this is only possible if Pf′ = 1 (i.e., we can reject all inaccurate results) and even then, only if this can be demonstrated to be universally true (I’m pretty sure that is impossible).
Also, I think more of a balance needs to be struck so as to avoid the situation whereby m is so small that the uncertainty of the final result (the average of the selected results) is so large that we cannot do anything meaningful with it.
As I mentioned, neither of us are statisticians, so some help and advice would be very welcome.
Cheers,
geo101
In the data set we have multiple specimens and each specimen yields multiple results. So in terms of random selection, if randomly selecting 1 result from specimen #1 has a probability P1 of being an accurate result, if we randomly select 1 result from all specimen (P1, P2, P3, … Pn, for n specimens), what is the probability (Pf) of having an accurate result in this final data set?
Now suppose we select the data in some way (that we hope/think/pray will reject inaccurate results). At the specimen level the probabilities become P1′, P2′, P3′, … Pm′, where m <= n. What this means, is that some specimens yield no acceptable results. The probability of having an accurate result in this final data set is now Pf′.
How would we assess if our selection process is increasing our chances of obtaining an accurate result? What is the best way to compare Pf and Pf′ (or Pf′ and Pf′′ obtained from two different selection processes), and what factors should we consider?
This is where I get into a philosophical debate with one of my colleagues (neither of us are statisticians). His argument is that as long as Pf′ > Pf the data selection is an improvement. My argument is that the significance of the difference between Pf and Pf′ depends on m and that smaller differences require larger m to be important.
His view is that it doesn't matter what m, is as long as the final result is accurate. My opinion is, that this is only possible if Pf′ = 1 (i.e., we can reject all inaccurate results) and even then, only if this can be demonstrated to be universally true (I’m pretty sure that is impossible).
Also, I think more of a balance needs to be struck so as to avoid the situation whereby m is so small that the uncertainty of the final result (the average of the selected results) is so large that we cannot do anything meaningful with it.
As I mentioned, neither of us are statisticians, so some help and advice would be very welcome.
Cheers,
geo101