MHB Corresponding character matching probability

  • Thread starter Thread starter vivek1
  • Start date Start date
  • Tags Tags
    Probability
AI Thread Summary
The discussion focuses on calculating the probability of matching k-mers in a protein dataset of 10,000 sequences. It highlights that the probability of an amino acid's occurrence is based on its frequency within the dataset. The key question is determining the likelihood that a k-mer "b" matches k-mer "a" in at least "r" positions out of "k". The conclusion emphasizes that without specific numerical data, an exact probability cannot be established. This analysis is crucial for understanding sequence similarities in protein research.
vivek1
Messages
1
Reaction score
0
I have a dataset of protein, consisting of 10000 sequence each, having length Si
, where 1<=i<=10000. Now, I extracted k-mer "a" from the 1st sequence. The probability of occurrence of amino acid (character of protein sequence) is given by its frequency in the dataset. If I choose k-mer "b" from other sequence, what will be the probability that k-mer "b" matches k-mer "a" at least in r position out of k position?
 
Mathematics news on Phys.org
I believe that would be the probability that k-mer a appears in the remaining 9999 sequences. Without numerical data we can't give an exact value.
 
Seemingly by some mathematical coincidence, a hexagon of sides 2,2,7,7, 11, and 11 can be inscribed in a circle of radius 7. The other day I saw a math problem on line, which they said came from a Polish Olympiad, where you compute the length x of the 3rd side which is the same as the radius, so that the sides of length 2,x, and 11 are inscribed on the arc of a semi-circle. The law of cosines applied twice gives the answer for x of exactly 7, but the arithmetic is so complex that the...
Back
Top