Empirical tests on DNA sequence

In summary: This is a quote from the pdf:The Random Name Naming algorithm (RANDNA) is a software application that generates random names. It uses a modified form of the uniform random number generator. The algorithm is based on the principle that a name should be as unique as possible and that two names that are similar should not be very different.The RANDNA program can be used to generate names for people, places, things, and concepts. The program can be used to generate a name for a person, place, thing, or concept. The program can be used to generate a name for a person, place, thing, or concept. The program can be used to generate a name for
  • #1
yevi
66
0
I am doing some biology research and need to check if a DNA sequence for complexity and repetitiveness, using Empirical test like Frequency test, Serial test and etc.
For that I need to represent or convert the sequence into scalar.

My question is basically, how to apply those tests on DNA sequence?
 
Physics news on Phys.org
  • #2
See http://www.bcu.gub.uy/autoriza/peiees/jor/2006/iees03j3030806.pdf .

A test like runs test is applicable to binary sequences. But DNA sequences are multiple valued (e.g. GATTACA). As a first approach, you could separate the 4 codes into 2 binary sets, e.g. "G or A" versus "T or C" and apply the runs test. Then regroup into "G or T" versus "A or C" and re-run. Finally, run "G or C" versus "T or A." If all of these groupings are consistent with randomness, then you could "conclude randomness." But this test would be relatively strong in that you could not conclude randomness unless all of the groupings are consistent with randomness.

Fortunately there is a more useful test. I can order the letters lexicographically (alphabetically) so that A > C > G > T. Then I can convert a sequence like GATTACA into G < A > T > T < A > C < A and re-write it as 011010 by writing "0" for "<" and "1" for ">." Since this is a binary sequence, I can apply the runs test. If there is any other ordering that you can think of as a biologist (which I am not), say G > A > T > C (perhaps on some luminal frequency scale), then by all means use that rule to convert the letters into 0's and 1's.
 
Last edited by a moderator:
  • #3
Thanks EnumaElish.

About frequency test,
Is this going to be a right approach for this test:

Count the frequency of a every letter (A,C,G,T) in sequence.
And apply a chi-square, where number of categories is 4, p =1/4 with number of observations that fall into every category and n as sequence length.

??
 
  • #4
You are right; a Chi-square test can be used, with 3 degrees of freedom. The null hypothesis is a uniform distribution with expected frequency E(i) = p(i) = 1/4 for i = A, C, G, T.
 
  • #5
EnumaElish,

What other tests (there are many of them) I can apply on dna. Test that don't need a binary sequence as an input.

And if all others, or the best of test are indeed work with bits, can you suggest more ways to represent sequence as bits.
 
  • #6
I've found this site http://www.bioinfo.de/isb/2006/06/0024/"
they are talking about a program that generates random DNA sequence.
And they mention some test that they have tried.

This is a quote from site:

Since the output of the tests is many pages long, we show the output of only one test. The outputs of the other tests are available at the web page www.introni.it/en/software/[/URL] with the software and the help files.

The ENT test evaluates a random sequence by means five parameters:

1. Entropy: this is the information density of the sequence, expressed as number of bits per character. In general, the higher this value the better is the randomness. The sequences we used for the tests were in ASCII format, that is, 8 bits per character so the entropy should be as near as possible to 8.
2. The chi-square test gives an absolute number and a percentage which indicates how frequently a truly random sequence would exceed the value calculated. We interpret the percentage as the degree to which the sequence tested is suspected of being non-random.
3. The Arithmetic Mean is the result of summing all the bytes in the sequence and dividing by the sequence length. If the data are close to random, this should be about 127.5.
4. Monte Carlo Value for π: each successive sequence of six bytes is used as 24 bit X and Y co-ordinates within a square. If the distance of the randomly-generated point is less than the radius of a circle inscribed within the square, the six-byte sequence is considered a "hit". The percentage of hits can be used to calculate the value of π. For very large streams the value will approach the correct value of π if the sequence is close to random.
5. Serial Correlation Coefficient: this quantity measures the extent to which each byte in the file depends upon the previous byte. For random sequences this value will be close to zero.
[/QUOTE]

As I understand the simply convert every latter to it's 8-bit ascii value? Am I right?

They heave more test examples in this pdf [URL="Since the output of the tests is many pages long, we show the output of only one test. The outputs of the other tests are available at the web page www.introni.it/en/software/ with the software and the help files."]Since the output of the tests is many pages long, we show the output of only one test. The outputs of the other tests are available at the web page www.introni.it/en/software/ with the software and the help files.[/URL]

[url]http://www.introni.it/en/software/RANDNA.pdf[/url]
 
Last edited by a moderator:
  • #7
These all seem to be valid procedures for determining randomness. And yes, my understanding is they are converting everything into 8-bit ASCII. Which makes it eminently suitable to apply the "runs test" that I described previously. Generally, see:

http://en.wikipedia.org/wiki/Algorithmic_information_theory
http://en.wikipedia.org/wiki/Algorithmically_random_sequence
http://en.wikipedia.org/wiki/Normal_number

Two caveats:
1. The tests you've referenced may not be independent statistically.
2. The term "test" which they use in the RANDNA description pages may or may not be synonymous with the term "test" in statistics. E.g., take the "pi" test (which uses the property that almost all real numbers are normal numbers [infinite random sequences]): it is not clear which statistical test they are using to test the computed pi against the true value of pi.
 
Last edited:
  • #8
So if converting char to 8 bit ascii fine, converting 4 DNA nucleotides for:
A=00, C=01, G=10, T=11 legit as well?
I ask if 2-bit representation can effect the test like "run", "serial"...
 
  • #9
I don't see why 8 bits are necessary. Two bits will work as well. You can even use one bit to test for "up" vs. "down."
 
  • #10
I am afraid that wrong bit representation will result a lossy code and effect the randomness.

i.e. sequence1 : AC = "0001" and sequence2: AT = "0011" should have same randomness but applying test on them will return a different result.
Am I right?
 
  • #11
My guess is they are using 8 bits to take advantage of existing (pre-programmed) "data mining" (or data analysis) software procedures. Otherwise, why stop at 8? Why not use a 16, 32, or 64, or 256 bit representation? In my humble opinion, the choice of bits is itself a random rule, so it shouldn't matter. But, to be "safe," you may code your data in several alternative ways (e.g. 2-bit vs. 8-bit) then apply the same test. If the DNA segment you are using is sufficiently long, my guess is that the coding will not make a qualitative difference. Actually, somewhere there may be a definition of "necessary and sufficient bit representation," in that representation of 4 categories in 8 bits is a waste of space, whereas 2 bits are necessary and sufficient.
 
  • #12
My main concern in this point is creating lossless bit representation.
Anyway, I'll try the tests using 2-bit.

Meanwhile are there more interesting (and good) test that you can recommend?
 
  • #13
yevi said:
i.e. sequence1 : AC = "0001" and sequence2: AT = "0011" should have same randomness but applying test on them will return a different result.
Am I right?
Actually a runs test will produce the exact result for both (2 runs in each, assuming the endpoints count as runs -- you should check this). But AC = 0001 (1 run) and AG = 0010 (3 runs), so I see your point. In 8-bit representation A = "00000001" C = "00000011" and T = "00010100" (I think), so AC = "0000000100000011" and AT = "0000000100010100"; and the runs test will return 4 for the first and 7 for the second.
 
Last edited:
  • #14
You could calculate the regression equation xt = a + b xt-1 + ut, where a and b are the regression coefficients and u is a stochastic disturbance with expected value = 0 (Eu = 0), and the subscript t denotes t'th place (bit) in the sequence (e.g. X = 0000101010111...). Your dependent variable xt is binary, so technically you should use a limited dependent variable model such as Probit or Logit; although the linear model is always a good starting point. If you cannot reject the null hypothesis b = 0, then you can conclude that xt is linearly uncorrelated with xt-1 -- which is similar to ENT's correlation test.

Generalizing, you can run the regression xt = a + b1xt-1 + ... + bkxt-k + ut (ideally using one of the limited-dep. var. techniques), then execute the joint test b1 = ... = bk = 0. This is a generalization of ENT's correlation test, in effect.

You can also use, for example, a multinomial logit regression package to calculate the statistical relation between Zt coded as, say, {A,C,G,T} = {1,2,3,4} and Zt-j coded similarly (for j = 1, ..., k); so you don't have to convert letters to bits.
 
Last edited:
  • #15
Small technical edit to previous post:
You can also use, for example, a multinomial logit regression package to calculate the statistical relation between Zs coded as, say, {A,C,G,T} = {1,2,3,4} and Zs-j coded similarly (for j = 1, ..., m); so you don't have to convert letters to bits.
The index for Z (data coded as letters) is different from the index for x (data coded as bits).

A further idea is to make use of the well-known normality tests by using a probability theorem which says that if R is a random variable uniformly distributed on [0, 1] and F is a probability distribution function then the random variable Y = F-1(R) is distributed (has distribution) F. In effect, you can use a DNA sequence as a uniform preudorandom number generator by applying to it the inverse of the normal distribution, then test whether the resulting data (the Y's) are indeed normal. Alternative approaches to generating Y are described here.

Again, since your uniform distribution is discrete, you may have to use a piecewise randomization algorithm that would enable you to expand from the 4 DNA elements to the [0, 1] continuum. For example, you could assign A = 1/8, C = 3/8, G = 5/8, T = 7/8. When you observe an A, you use a software (e.g. Excel) to generate a random number (R) between 0 and 2/8. For a C, generate a ran. var. b/w 2/8 and 4/8. For G, b/w 4/8 and 6/8; for T, 6/8 and 1. Then apply the inverse-F transformation, then test the new ran. var. Y for normality. Even better (but more time consuming) is to design a fractal structure (interval tree) whereby you map the the first 4 letters in a DNA into 5 equal subintervals of the [0, 1], then map the next 16 letters into 25 equal sub-subintervals, and so on, in a way that is graphically reminiscent of constructing a Cantor set (except in your case you will not be deleting any of the subintervals, but further dividing each subinterval into smaller and smaller segments). By dividing the [0, 1] interval into consecutively smaller subintervals, you can "fill in" the [0, 1] continuum to an arbitrary density by using progressively expanding blocks of letters in a DNA sequence, then run the inverse-F normality test.
 
Last edited:
  • #16
Another idea is to use the "equality of distributions" (e.o.d.) tests (Run [or Runs] test [apparently a.k.a. Wald-Wolfowitz test], Median test [may be a.k.a. Mann-Whithey U test], Rank-sum test) (Mood, Graybill, Boes. Intro. to the Theory of Stat., 3rd Ed.) by testing the hypothesis that the distribution of a DNA sequence is equal to a uniform distribution produced by a random number generator. First, you need to generate a random sequence of 0's and 1's (or alternatively A's, C's, G's and T's) by using a standard random number generator. This will be your benchmark sequence. Then you can use any of the e.o.d. tests to test the hypothesis that the distribution of an actual DNA sequence is not different from the benchmark distribution. The advantage of this approach is that as long as you can find a way to generate random letters A, C, G, T, you do not need to convert your DNA into bits (although you may have to rank them, e.g. A=1, C=2, G=3, T=4).
 
Last edited:
  • #18
Thank you.
 
  • #19
Any DNA sequence is composed of four characters, so it can be thought of a sequence of base-4 numbers. Let A=0, C=1, G=2, T=3. Then GATTAC = 203301. In base 10, this is equal to 2289: http://members.aol.com/SciRealm/SciCalculator.html But you could also express this as {GAT, TAC} = {203, 301} = {35, 49} or as {GA, TT, AC} = {20, 33, 01} = {8, 15, 1}.

I am now thinking that the way I have advised to construct the "inverse-F" normality test may introduce randomness to a potentially nonrandom sequence, which will bias the test result toward "false randomness."
 
Last edited by a moderator:
  • #20
Fine but this kind of representation(base-4 or base-10) can help me with only specific tests.

Still I don't know what should I choose for my problem from your great suggestions and other test I've read about.
 
  • #21
p.s.
Diehard suite that you mentioned accepts as input binary sequence as well.
 
  • #22
Since you prefer a non-binary representation, you should definitely apply the chi-square test.

Then go back to post #16 above and see which of the tests I mentioned there require binary coding, and which do not. You can use those that do not require a binary coding, for example the Wilcoxon signed-rank test.

You could use a regression model to estimate j'th order autocorrelations and test that each is zero (individually and jointly) using the "base 4" coding. To repeat, you can calculate the statistical relation between Zs coded as, say, {A,C,G,T} = {0,1,2,3} and Zs-j coded similarly (for j = 1, ..., m): Zs = a + b1Zs-1 + ... + bmZs-m + us then test b1 = ... = bm = 0. You should start with linear regression (which works best with a continuous dependent [left-hand-side] variable); there are ready-made statistical software specifically written for "detecting j'th order autocorrelation," which is what you will be estimating. Later, you can use a multinomial logit package for "technically correct" estimation (in MNL, the dependent variable is multi-valued discrete).
 
Last edited:
  • #23
Also, I think you can use the "inverse-F" normality tests, or a uniformity test.

The order in which you group the letters to form base-4 numbers is not important. In the previous example, {v1, v2, v3} = {GA, TT, AC} = {204, 334, 014} and {w1, w2} = {GAT, TAC} = {2034, 3014}. In base 4, w1 = v1 * 10 + Integer(v2/10) and w2 = Decimal(v2/10) * 10 * 100 + v3. So each w is a deterministic transformation of the v's. If the v's are random, so are the w's, and vice versa. Which implies if the v's are nonrandom then the w's are nonrandom, too.

Let's say you decided on a length-6 string to represent a base-4 number, e.g. GATTAC. The largest base 4 number with 6 digits is 333333, which is 4095 in base 10 (4095 = 46 - 1, which isn't a coincidence). That means you can divide the [0, 1] interval into 4096 = 212 subintervals, each 1/4096 = 0.000244140625 long. For string "GATTAC" = 2033014 = 2289, you'd mark 2289/4096 = 0.558837890625 (check this), near the midpoint of the [0, 1] interval. To use the normality tests, you should apply the inverse normal transformation and solve P(normalrandom < y) = 0.558837890625 for y = 0.148024. That is, StandardNormal(y) = (1 + Erf(y/sqrt(2)))/2 = 0.558837890625 has root y = 0.148024.

If the DNA is 240 letters long, then you will have 240/6 = 40 numbers Z4 = (Z14 ... Z404) in base-4. Then:
1. convert Z4 into base-10 Z10 = (Z110 ... Z4010),
2. convert Z10 into uniform random variables U = (U1 ... U40) = (Z110/4096 ... Z4010/4096),
3. transform U into normal random variables Y = (Y1 ... Y40) by solving StandardNormal(Y) = U or Y = StandardNormal-1(U)
4. apply these normality tests to Y.

You can also directly test for uniformity of U using this test.
 
Last edited:
  • #24
Well today I've talked with an expert in this field (prof. in bioinformatics).
First he told me that it's not a good idea to convert DNA letters to binary representation cause test like RUN test won't work...

He recommended to use following tests Frequency,Frequency in Blocks, Run test, Coupon test, some template-based tests, and Autocorrelation.

I will appreciate if you can explain how to apply the last method and normalize it.

p.s. keep in mind that our "test subject" relatively short, about 300-400 letters.
 
  • #25
Most "statistical" software (including Excel) have a correlation function that returns the Pearson partial correlation coefficient between two vectors of data. Suppose you organize your DNA sequence as a column of numbers on an Excel spreadsheet:

2
0
3
3
0
1
etc.

Let's say you entered the entire DNA sequence in column A, rows 1 through 240 (range A1:A240 in Excel notation).

Then you add another column. On row r of the second column (column B), copy (point to) the value on row r-1 of the first column (column A). Thus the first row on column B will be blank (not zero). The second value on column B will be equal to the first value on column A. The third value on column B will be equal to the second value on column A. And so on. The easiest way to do this is to go to the second row on Column B (cell B2) and write the expression:

=A1

then hit the "Enter" key. Then copy cell B2 and paste it over the cell range B3 through B240.

Then use the CORREL function like this: Go to a new blank cell and write the following expression:

=CORREL(A2:A240,B2:B240)

and hit "enter". This will produce the correlation coefficient with the DNA sequence and its once-lagged value, which is technically known as the first-order autocorrelation coefficient.

A shortcut is to have a single column (column A), then use:

=CORREL(A2:A240,A1:A239)

You can use this setup to calculate any arbitrary order of autocorrelation. For the 10th-order AC, the shortcut is

=CORREL(A11:A240,A1:A230)

There are two downsides to the Excel approach:
1. CORREL function doesn't test for statistical significance,
2. You can only calculate partial correlations (one at a time), rather than an autocorrelation structure (a number of lags jointly, a.k.a. "distributed lags").

Excel's Analysis ToolPak Add-In will let you calculate partial correlations between pairs of an arbitrary number of variables (columns or ranges); but each of the two downsides would still apply. (Note: to use the Analysis ToolPak Add-In, you have to specify [or input] each lag as a separate column -- you cannot use the shortcut described above).

You can use a "serious" statistical software like SAS to compute partial correlation coefficients and test the statistical significance of each (individually). But downside #2 would still apply. The most efficient way to eliminate both downsides is to use multiple regression analysis, which you can do in Excel using the Analysis ToolPak Add-In.

See #22 above for an explanation of how to estimate the first through the m'th order autocorrelations jointly using the multiple regression technique. Again, the main advantage of this approach is that it let's you to calculate and test a number of autocorrelations (lags) jointly, rather than individually.

To estimate the first order AC, all you need is to estimate the regression equation Zs = a + b1Zs-1 + us. To estimate the first and the second order AC's jointly, you need to estimate Zs = a + b1Zs-1 + b2Zs-2 + us. Etc. The data structure should be "column A, column B, etc.," where column A holds the original sequence and each subsequent column holds the once-lagged value of the previous column. Thus, column K will hold the 10 times-lagged sequence.

To test each lag individually, look at the t-statistic of the b parameter corresponding to the lag (e.g. t-stat of bj for the j'th lag.) To test all lags jointly, look at the "regression F" statistic in the ANOVA table in Excel output. If you have a single independent (right-hand side) variable (as the x in y = a + b x), then Regression F = (t-stat of b)2.

See attached Excel printout (into PDF). A random "DNA" sequence was regressed on distributed lags 1 through 10; none of the coefficients is statistically significant (t-stat > 2 approximately, or P-value < 0.05); neither are they jointly significant ("Significance F" is nowhere near 0.05). (I am using a 5% probability threshold for statistical significance.) Page 1 is the regression output; pages 2-5 are the data (column A is the dependent variable or the "Y variable", columns B-K are the indep. vars. or the "X variables." Rows 1-11 are not included in the regression). https://www.physicsforums.com/attachments/8791
 

Attachments

  • AC regression.pdf
    26.4 KB · Views: 412
Last edited:
  • #26
I am also surprised that the Prof. did not mention the Chi-square test or the uniformity test.
 
  • #27
What do you think, can we modify the run test to count run-ups of same letters
aka. A|CC|G|TT|CCC|G|AA|TTT?
 
  • #28
All examples of the R.T. that I have seen (or those I can remember, anyway) had two categories of data. Applying a 2-category test to 4 categories would bias the result toward "false randomness." I think the Chi-sq. test which you were the first to mention is more suitable.

In my mind, the R.T. can be applied as an up/down test, as long as you have some circular ordering, e.g. A > C > G > T > A. Circularity is important for the test to treat the endpoints (A and T) similar to the interior points (C and G).
 
  • #29
You can also try an ANOVA approach. You can do this by linear regression. Your "Y" variable is the DNA sequence coded as "Y in {0, 1, 2, 3}." Any permutation or affine transformation will do as well:

1. {A, C, G, T} = {0, 1, 2, 3} will not produce a statistically different result than {A, C, G, T} = {1, 3, 2, 0}
2. Y* = c0 + c1Y (where the two c's are constants) will not produce a statistically different result than Y.

The regression is Y(s) = b0 + b1D1(s) + b2D2(s) + b3D3(s) + u(s)

where each "dummy variable" D is defined as: Di(s) = 1 if DNA(s) = i and Di(s) = 0 otherwise. For example, in sequence GAT... = 203..., DNA(1) = 2, DNA(2) = 0, DNA(3) = 3. So, for the 1st observation D1(1) = D3(1) = 0, D2(1) = 1. Then for the 2nd observation D1(2) = D2(2) = D3(2) = 0 (there is no explicit D0, implicitly that's the intercept). For the 3rd obs., D1(3) = D2(3) = 0, D3(3) = 1. Etc.

If the DNA is random, then you'd expect:
1. each b = 0 individually except b0 = 1/4,
2. b1 = b2 = b3 = 0 jointly.


You can definitely test both 1 and 2 in Excel easily. For individual tests, look at the P-values. For the joint test, look at the "Significance F."
 
Last edited:
  • #30
I am not sure that circular ordering like this is suitable and can be used for run
 
  • #31
hmm Anova seems interesting
 
  • #32
yevi said:
I am not sure that circular ordering like this is suitable and can be used for run
You can use any ordering you like as long as you "trick" the test into treating the endpoints similarly with the interior points.
 
  • #33
yevi said:
hmm Anova seems interesting
Excel Analysis ToolPak Add-In (which, if installed, will show up as an item "Data Analysis" under the "Tools" menu) also has built-in ANOVA packages -- which I have never used because (for me) regression is more intuitive and practical.
 
  • #34
See edits to the tests described in #29
 
  • #35
This I understand, I need a "trick" which is not lossy.
If, first I run with:
A > C > G > T > A
and then:
A >T > C > G > A
on same sequence, and get different results, that means trick is lossy.
 

Similar threads

Back
Top