skip to main content
article

Composition Vector Method Based on Maximum Entropy Principle for Sequence Comparison

Published: 01 January 2012 Publication History

Abstract

The composition vector (CV) method is an alignment-free method for sequence comparison. Because of its simplicity when compared with multiple sequence alignment methods, the method has been widely discussed lately; and some formulas based on probabilistic models, like Hao's and Yu's formulas, have been proposed. In this paper, we improve these formulas by using the entropy principle which can quantify the nonrandomness occurrence of patterns in the sequences. More precisely, existing formulas are used to generate a set of possible formulas from which we choose the one that maximizes the entropy. We give the closed-form solution to the resulting optimization problem. Hence, from any given CV formula, we can find the corresponding one that maximizes the entropy. In particular, we show that Hao's formula is itself maximizing the entropy and we derive a new entropy-maximizing formula from Yu's formula. We illustrate the accuracy of our new formula by using both simulated and experimental data sets. For the simulated data sets, our new formula gives the best consensus and significant values for three different kinds of evolution models. For the data set of tetrapod 18S rRNA sequences, our new formula groups the clades of bird and reptile together correctly, where Hao's and Yu's formulas failed. Using real data sets with different sizes, we show that our formula is more accurate than Hao's and Yu's formulas even for small data sets.

References

[1]
D.R. Arahal, F.E. Dewhirst, B.J. Paster, B.E. Volcani, and A. Ventosa, "Phylogenetic Analyses of Some Extremely Halophilic Archaea Isolated from Dead Sea Water, Determined on the Basis of Their 16S rRNA Sequences," Applied and Environmental Microbiology, vol. 62, pp. 3779-3786, 1996.
[2]
M.W. Berry, Z. Drmac, and E.R. Jessup, "Matrices, Vector Spaces, and Information Retrieval," SIAM Rev., vol. 41, pp. 335-362, 1999.
[3]
D.P. Bertsekas, "Nonlinear Programming," Athena Scientific, 1999.
[4]
V. Brendel, J.S. Beckmann, and E.N. Trifonov, "Linguistics of Nucleotide Sequences: Morphology and Comparison of Vocabularies," J. Biomolecular Structure and Dynamics, vol. 4, pp. 11-20, 1986.
[5]
K.H. Chu, J. Qi, Z.G. Yu, and V. Anh, "Origin and Phylogeny of Chloroplasts: A Simple Correlation Analysis of Complete Genomes," Molecular Biology and Evolution, vol. 21, pp. 200-206, 2004.
[6]
L. Gao and J. Qi, "Whole Genome Molecular Phylogeny of Large dsDNA Viruses Using Composition Vector Method," BMC Evolutionary Biology, vol 7, pp. 1-7, 2007.
[7]
L. Gao, J. Qi, H. Wei, Y. Sun, and B.L. Hao, "Molecular Phylogeny of Coronaviruses Including Human SARS-CoV," Chinese Science Bull., vol. 48, pp. 1170-1174, 2003.
[8]
Z. Gu, X. Zhao, N. Li, and C. Wu, "Complete Sequence of the Yak (Bos Grunniens) Mitochondrial Genome and its Evolutionary Relationship with Other Ruminants," Molecular Phylogenetics and Evolution, vol. 42, pp. 248-255, 2007.
[9]
B.L. Hao, J. Qi, and B. Wang, "Prokaryotic Phylogeny Based on Complete Genomes without Sequence Alignment," Modern Physics Letters B, vol. 2, pp. 1-4, 2003.
[10]
S.B. Hedges, K.D. Moberg, and L.R. Maxson, "Tetrapod Phylogeny Inferred from 18S and 28S Ribosomal RNA Sequences and a Review of the Evidence for Amniote Relationships," Molecular Biology and Evolution, vol. 7, pp. 607-633, 1990.
[11]
R. Hu and B. Wang, "Statistically Significant Strings are Related to Regulatory Elements in the Promoter Regions of Saccharomyces Cerevisiae," Physica A: Statistical Mechanics and Its Applications, vol. 290, pp. 464-474, 2001.
[12]
G. Lu, S. Zhang, and X. Fang, "An Improved String Composition Method for Sequence Comparison," BMC Bioinformatics, vol. 9, Suppl 6, p. S15, 2008.
[13]
T. Margush and F.R. McMorris, "Consensus n-Trees," Bull. of Math. Biology, vol. 43, pp. 239-244, 1981.
[14]
S.B. Needleman and C.D. Wunsch, "A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins," J. Molecular Biology, vol. 48, pp. 443-453, 1970.
[15]
P.A.S. Nuin, Z. Wang, and E.R.M. Tillier, "The Accuracy of Several Multiple Sequence Alignment Programs for Proteins," BMC Bioinformatics, vol 7, pp. 1-18, 2006.
[16]
G.J. Phillips, J. Arnold, and R. Ivarie, "Mono-Through Hexanucleotide Composition of the Escherichia Coli Genome: A Markov Chain Analysis," Nucleic Acids Research, vol. 15, pp. 2611-2626, 1987.
[17]
J. Qi, B. Wang, and B.L. Hao, "Whole Proteome Prokaryote Phylogeny without Sequence Alignment: A k-String Composition Approach," J. Molecular Evolution, vol. 58, pp. 1-11, 2004.
[18]
M.S. Rosenberg, "MySSP: Non-Stationary Evolutionary Sequence Simulation, Including Indels," Evolutionary Bioinformatics Online, vol. 1, pp. 51-53, 2005.
[19]
N. Saitou and M. Nei, "The Neighbor-Joining Method: A New Method for Reconstructing Phylogenetic Trees," Molecular Biology and Evolution, vol. 4, pp. 406-425, 1987.
[20]
A.E. Shannon, "A Mathematical Theory of Communication," Bell System Technical J., vol. 27, pp. 379-423, 1948.
[21]
D. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, third ed. CRC Press, 2004.
[22]
T.T. Smith and M.S. Waterman, "Identification of Common Molecular Subsequences," J. Molecular Biology, vol. 147, pp. 195- 197, 1981.
[23]
S.J. Sogin, M.L. Sogin, and C.R. Woese, "Phylogenetic Measurement in Procaryotes by Primary Structural Characterization," J. Molecular Evolution, vol. 1, pp. 173-184, 1972.
[24]
G.W. Stuart and M.W. Berry, "An SVD-Based Comparison of Nine Whole Eukaryotic Genomes Supports a Coelomate Rather than Ecdysozoan Lineage," BMC Bioinformatics, vol 5, p. 204, 2004.
[25]
G.W. Stuart, K. Moffett, and J.J. Leader, "A Comprehensive Vertebrate Phylogeny Using Vector Representations of Protein Sequences from Whole Genomes," Molecular Biology and Evolution, vol. 19, pp. 554-562, 2002.
[26]
J.A. Studier and K.J. Keppler, "A Note of the Neighbor-Joining Algorithm of Saitou and Nei," Molecular Biology and Evolution, vol. 5, pp. 729-731, 1988.
[27]
K. Tamura, J. Dudley, M. Nei, and S. Kumar, "MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) Software Version 4.0," Molecular Biology and Evolution, vol. 24, pp. 1596-1599, 2007.
[28]
S. Vinga and J. Almeida, "Alignment Free Sequence Comparison-a Review," Bioinformatics, vol. 19, pp. 513-523, 2003.
[29]
C.R. Woese, "Bacterial Evolution," Microbiology Rev., vol. 51, pp. 221-271, 1987.
[30]
X. Wu, X.F. Wan, G. Wu, D. Xu, and G. Lin, "Phylogenetic Analysis Using Complete Signature Information of Whole Genomes and Clustered Neighbor-Joining Method," Int'l J. Bioinformatics Research and Applications, vol. 2, pp. 219-248, 2006.
[31]
X. Xia, Z. Xie, and K.M. Kjer, "18S Ribosomal RNA and Tetrapod Phylogeny," Systematic Biology, vol. 52, pp. 283-295, 2003.
[32]
H.M. Xie, Grammatical Complexity and One-Dimensional Dynamical Systems. World Scientific, 1996.
[33]
Z.G. Yu, X.W. Zhan, G.S. Han, R.W. Wang, V. Anh, and K.H. Chu, "Proper Distance Metrics for Phylogenetic Analysis Using Complete Genomes without Sequence Alignment," Int'l J. Molecular Sciences, vol. 11, pp. 1141-1154, 2010.
[34]
Z.G. Yu, L.Q. Zhou, V. Anh, K.H. Chu, S.C. Long, and J.Q. Deng, "Phylogeny of Prokaryotes and Chloroplasts Revealed by a Simple Composition Approach on All Protein Sequences from Whole Genome without Sequence Alignment," J. Molecular Evolution, vol. 60, pp. 538-545, 2005.

Cited By

View all
  • (2023)A Precise Bare Simulation Approach to the Minimization of Some Distances. I. FoundationsIEEE Transactions on Information Theory10.1109/TIT.2022.321549669:5(3062-3120)Online publication date: 1-May-2023
  • (2020)PEERInformation Sciences: an International Journal10.1016/j.ins.2019.12.072517:C(393-414)Online publication date: 1-May-2020
  1. Composition Vector Method Based on Maximum Entropy Principle for Sequence Comparison

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics
    IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 9, Issue 1
    January 2012
    341 pages

    Publisher

    IEEE Computer Society Press

    Washington, DC, United States

    Publication History

    Published: 01 January 2012
    Published in TCBB Volume 9, Issue 1

    Author Tags

    1. Composition vector method
    2. alignment-free sequence comparison
    3. maximum entropy principle
    4. optimization model
    5. phylogenetics.

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)A Precise Bare Simulation Approach to the Minimization of Some Distances. I. FoundationsIEEE Transactions on Information Theory10.1109/TIT.2022.321549669:5(3062-3120)Online publication date: 1-May-2023
    • (2020)PEERInformation Sciences: an International Journal10.1016/j.ins.2019.12.072517:C(393-414)Online publication date: 1-May-2020

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media