Empirical tests on DNA sequence

In summary: This is a quote from the pdf:The Random Name Naming algorithm (RANDNA) is a software application that generates random names. It uses a modified form of the uniform random number generator. The algorithm is based on the principle that a name should be as unique as possible and that two names that are similar should not be very different.The RANDNA program can be used to generate names for people, places, things, and concepts. The program can be used to generate a name for a person, place, thing, or concept. The program can be used to generate a name for a person, place, thing, or concept. The program can be used to generate a name for
  • #36
But if the sequence is random, you shouldn't. So under the null hypothesis (random DNA), the trick isn't "lossy."
 
Physics news on Phys.org
  • #37
You last statement is very interesting (for me) I am going to apply run tests with various cycling orders and see what I'll get :)
 
  • #38
What about some "pattern matching" based, tests I can use?
Maybe "Poker"?
 
  • #39
Surely, you can apply any of these tests. (ANOVA is a more rigorous version of the frequency test.)
 
  • #40
If you don't mind, I'll return to RUN test and bring a quote from a book
THE ART OF COMPUTER PROGRAMMING V 2, DONALD E. KNUTH

Run test. A sequence may also be tested for "runs up" and "runs down."
This means we examine the length of monotone subsequences of the original
sequence, i.e., segments that are increasing or decreasing.
As an example of the precise definition of a run, consider the sequence of ten
numbers "1298536704"; putting a vertical line at the left and right and between
Xj and Xj+1 whenever Xj >Xj+1, we obtain |1 2 9| 8|5| 3 6 7 |0 4|, which displays the "runs up": there is a run of length 3, followed by two runs of length 1, followed by another run of length 3, followed by a run of length 2.

In this test he doesn't use circular ordering. Why he uses only "run-up"?
 
Last edited:
  • #41
I suspect he is not using circularity because two endpoints (0 and 9) out of 10 are relatively few; as opposed to two in four. And since he is looking at run ups only, that's actually one out of 10.

A runs test can be set up to test the number of run ups, the # of run downs, the # of both ups and downs, the # of runs above/below the mean or the median, the # of runs of a predetermined length, the maximum run length, and I suppose many more. Each of these tests is based on a different way of constructing the random variable "a run" (except the last, where the r.v. is the "run length"). A random variable can be thought of as another name for a probability distribution. You can construct any test if you know the distribution that applies to that test.
 
Last edited:
  • #42
Please disregard #29 above.

Problems:
1. The regression equation will not measure what it is intended for,
2. Although for very long random sequences (e.g. > 1,000) each letter should appear a quarter of the time, for a relatively short random sequence there can be significantly more of one letter than another.

At the very least, this needs more thought on my part.
 
Last edited:
  • #43
ok,very well
 
  • #44
Now I am trying to apply the Frequency within a block test that was suggested by Prof.

For that I am using the chi as I wrote on the first page
(Count the frequency of a every letter (A,C,G,T) in sequence.
And apply a chi-square, where number of categories is 4, p =1/4 with number of observations that fall into every category and n as sequence length.)

My question is what should be the size of block (the input is 350-400)? And what results of each block-test I should expect to conclude that sequence is random?

thanks.
 
  • #45
[PLAIN said:
http://en.wikipedia.org/wiki/Pearson%27s_chi-square_test]The[/PLAIN] approximation to the chi-square distribution breaks down if expected frequencies are too low. It will normally be acceptable so long as no more than 10% of the events have expected frequencies below 5. Where there is only 1 degree of freedom, the approximation is not reliable if expected frequencies are below 10. In this case, a better approximation can be had by reducing the absolute value of each difference between observed and expected frequencies by 0.5 before squaring; this is called Yates' correction.
See also: http://www.statisticssolutions.com/Chi_square_test.htm

You can also use regression analysis (ANOVA) to test:
1. whether the freq. of a letter within any block is equal to the freq. of the same letter in any other block,
2. whether the difference between the freqs. of two letters is significant within each block,
3. whether the difference between freqs. is different across blocks.

See attached Excel printout with 2 blocks. (These are t tests, so the sample size should be at least 20 letters per block.) The models are:

D1 = b2 + d1 block1 + u
D1 = b1 + d2 block2 + u
D1 = b1 block1 + b2 block2 + u (Constant [the intercept] is Zero)

D1-D2 = b2* + d1* block1 + u
D1-D2 = b1* + d2* block2 + u
D1-D2 = b1* block1 + b2* block2 + u (Constant [the intercept] is Zero)

which show d1 = d2 = d1* = d2* = 0 statistically (t-stat too low or p-value too high), so there is no difference between the blocks. (Although, b2* is statistically significant, which implies a statistically significant difference between the freqs. of C and G within block 2.)

If you had 5 blocks, you could run:

D1 = b1 + d2 block2 + d3 block3 + d4 block4 + d5 block5 + u

where the estimated b1 coefficient is freq. of C in the 1st block and estimated dj coefficient (j > 1) is the difference between the freq. of C in the first block and the freq. of C in the j'th block.

To test whether b1 = 1/4, the "Y" variable was redefined as D1 - 1/4 as in Attachment 2:

D1-0.25 = b1** block1 + b2** block2 + u (Constant [the intercept] is Zero)

which shows that the freq. of C in either block is not statistically different from 0.25 at the 5% level of significance. (Note: bj** = bj - 0.25 where bj is the expected freq. of C in the j'th block, as in the D1 models above.)
 

Attachments

  • ANOVA_regression.pdf
    18.5 KB · Views: 451
  • ANOVA regression 2.pdf
    8.5 KB · Views: 340
  • ANOVA regression 3.pdf
    18 KB · Views: 285
Last edited by a moderator:
  • #46
For frequency within a block test I prefer to use Chi-square (Pearson's), like I did in "standard" frequency test.
Analyzing one or all of these 3 suggestions of yours.
 
  • #47
yevi said:
For frequency within a block test I prefer to use Chi-square (Pearson's), like I did in "standard" frequency test.
I understand. Chi-sq. is nonparametric, which some people take as an advantage. OTOH, the parametric regression/ANOVA approach let's you to test many hypotheses simultaneously (jointly), including "difference-in-differences." In those respects the regression/ANOVA approach can be nested to an arbitrary depth.
 
  • #48
So what you saying is that chi is not suitable for my specific needs?
 
  • #49
That is not at all what I am saying. On the contrary, a nonparametric test can be seen as an advantage. Having said that, I am pointing you toward a complementary approach (ANOVA). It is not an either/or situation. You can apply both types of tests.
 
Last edited:
  • #50
Got it. Thanks for clearing it up :)
 
  • #51
go for any algorithm(local 0o global) rather than rounding up to a conclusion.that will help ur research
 

Similar threads

Back
Top