'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve
-
* Corresponding author: Eran Elhaik eelhaik@gmail.com
1 McKusick - Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
2 Department of Biology & Biochemistry, University of Houston, Houston, TX 77204-5001, USA
3 Department of Mathematics, University of Houston, Houston, TX 77204-3008, USA
Biology Direct 2010, 5:10 doi:10.1186/1745-6150-5-10
Published: 17 February 2010Abstract
Background
The Z-curve is a three dimensional representation of DNA sequences proposed over a
decade ago and has been extensively applied to sequence segmentation, horizontal gene
transfer detection, and sequence analysis. Based on the Z-curve, a "genome order index,"
was proposed, which is defined as S = a2+ c2+t2+g2, where a, c, t, and g are the nucleotide frequencies of A, C, T, and G, respectively. This index was found
to be smaller than 1/3 for almost all tested genomes, which was taken as support for
the existence of a constraint on genome composition. A geometric explanation for this
constraint has been suggested. Each genome was represented by a point P whose distance from the four faces of a regular tetrahedron was given by the frequencies
a, c, t, and g. They claimed that an inscribed sphere of radius r = 1/
contains almost all points corresponding to various genomes, implying that S <r2. The distribution of the points P obtained by S was studied using the Z-curve.
Results
In this work, we studied the basic properties of the Z-curve using the "genome order index" as a case study. We show that (1) the calculation of the radius of the inscribed sphere of a regular tetrahedron is incorrect, (2) the S index is narrowly distributed, (3) based on the second parity rule, the S index can be derived directly from the Shannon entropy and is, therefore, redundant, and (4) the Z-curve suffers from over dimensionality, and the dimension stands for GC content alone suffices to represent any given genome.
Conclusion
The "genome order index" S does not represent a constraint on nucleotide composition. Moreover, S can be easily computed from the Gini-Simpson index and be directly derived from entropy and is redundant. Overall, the Z-curve and S are over-complicated measures to GC content and Shannon H index, respectively.
Reviewers
This article was reviewed by Claus Wilke, Joel Bader, Marek Kimmel and Uladzislau Hryshkevich (nominated by Itai Yanai).