skip to main content
article

On Subset Seeds for Protein Alignment

Published: 01 July 2009 Publication History

Abstract

We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets. We then perform a comparative analysis of seeds built over those alphabets and compare them with the standard Blastp seeding method [2], [3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seeds is less expressive (but less costly to implement) than the cumulative principle used in Blastp and vector seeds, our seeds show a similar or even better performance than Blastp on Bernoulli models of proteins compatible with the common BLOSUM62 matrix. Finally, we perform a large-scale benchmarking of our seeds against several main databases of protein alignments. Here again, the results show a comparable or better performance of our seeds versus Blastp.

References

[1]
G. Kucherov, L. Noé, and M. Roytberg, "A Unifying Framework for Seed Sensitivity and Its Application to Subset Seeds," J. Bioinformatics and Computational Biology, vol. 4, no. 2, pp. 553- 570, Apr. 2006 (preliminary version in WABI '05).
[2]
S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, "Basic Local Alignment Search Tool," J. Molecular Biology, vol. 215, pp. 403-410, 1990.
[3]
S. Altschul, T. Madden, A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, "Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs," Nucleic Acids Research, vol. 25, no. 17, pp. 3389-3402, 1997.
[4]
D. Brown, "Optimizing Multiple Seed for Protein Homology Search," IEEE/ACM Trans. Computational Biology and Bioinformatics , vol. 2, no. 1, pp. 29-38, Jan. 2005 (early version appeared in WABI '04).
[5]
B. Ma, J. Tromp, and M. Li, "PatternHunter: Faster and More Sensitive Homology Search," Bioinformatics, vol. 18, no. 3, pp. 440- 445, 2002.
[6]
W.J. Kent, "BLAT-The BLAST-Like Alignment Tool," Genome Research, vol. 12, pp. 656-664, 2002.
[7]
M. Li, B. Ma, D. Kisman, and J. Tromp, "PatternHunter II: Highly Sensitive and Fast Homology Search," J. Bioinformatics and Computational Biology, vol. 2, no. 3, pp. 417-439, 2004 (earlier version in GIW '03).
[8]
L. Noé and G. Kucherov, "YASS: Enhancing the Sensitivity of DNA Similarity Search," Nucleic Acid Research, vol. 33, pp. W540- W543, 2005.
[9]
D. Mak, Y. Gelfand, and G. Benson, "Indel Seeds for Homology Search," Bioinformatics, vol. 22, no. 14, pp. e341-e349, 2006.
[10]
M. Csürös and B. Ma, "Rapid Homology Search with Neighbor Seeds," Algorithmica, vol. 48, no. 2, pp. 187-202, June 2007.
[11]
B. Brejova, D. Brown, and T. Vinar, "Vector Seeds: An Extension to Spaced Seeds," J. Computer and System Sciences, vol. 70, no. 3, pp. 364-380, 2005.
[12]
Y. Sun and J. Buhler, "Designing Multiple Simultaneous Seeds for DNA Similarity Search," Proc. Eighth Ann. Int'l Conf. Computational Molecular Biology (RECOMB '04), Mar. 2004.
[13]
G. Kucherov, L. Noé, and M. Roytberg, "Multi-Seed Lossless Filtration," Proc. 15th Ann. Combinatorial Pattern Matching Symp. (CPM '04), pp. 297-310, July 2004.
[14]
I.-H. Yang, S.-H. Wang, Y.-H. Chen, P.-H. Huang, L. Ye, X. Huang, and K.-M. Chao, "Efficient Methods for Generating Optimal Single and Multiple Spaced Seeds," Proc. IEEE Fourth Symp. Bioinformatics and Bioeng. (BIBE '04), pp. 411-416, 2004.
[15]
J. Xu, D. Brown, M. Li, and B. Ma, "Optimizing Multiple Spaced Seeds for Homology Search," Proc. 15th Symp. Combinatorial Pattern Matching, pp. 47-58, July 2004.
[16]
D. Kisman, M. Li, B. Ma, and L. Wang, "tPatternHunter: Gapped, Fast and Sensitive Translated Homology Search," Bioinformatics, vol. 21, no. 4, pp. 542-544, 2005.
[17]
P. Peterlongo, L. Noé, D. Lavenier, G. Georges, J. Jacques, G. Kucherov, and M. Giraud, "Protein Similarity Search with Subset Seeds on a Dedicated Reconfigurable Hardware," Proc. Second Workshop Parallel Computational Biology, 2007.
[18]
V.H. Nguyen and D. Lavenier, "Speeding Up Subset Seed Algorithm for Intensive Protein Sequence Comparison," Proc. Sixth IEEE Int'l Conf. Research, Innovation & Vision for the Future (RIVF '08), pp. 57-63, 2008.
[19]
L. Noé and G. Kucherov, "Improved Hit Criteria for DNA Local Alignment," BMC Bioinformatics, vol. 5, no. 149, Oct. 2004.
[20]
L. Zhou, J. Stanton, and L. Florea, "Universal Seeds for cDNA-to-Genome Comparison," BMC Bioinformatics, vol. 9, no. 36, 2008.
[21]
B. Ma and H. Yao, "Seed Optimization is No Easier Than Optimal Golomb Ruler Design," Proc. Sixth Asia Pacific Bioinformatics Conf. (APBC '08), pp. 133-144, Jan. 2008.
[22]
U. Keich, M. Li, B. Ma, and J. Tromp, "On Spaced Seeds for Similarity Search," Discrete Applied Math., vol. 138, no. 3, pp. 253- 263, 2004 (preliminary version in 2002).
[23]
T. Li, K. Fan, J. Wang, and W. Wang, "Reduction of Protein Sequence Complexity by Residue Grouping," J. Protein Eng., vol. 16, pp. 323-330, 2003.
[24]
L. Murphy, A. Wallqvist, and R. Levy, "Simplified Amino Acid Alphabets for Protein Fold Recognition and Implications for Folding," J. Protein Eng., vol. 13, pp. 149-152, 2000.
[25]
S. Cheng and Y.-F. Xu, "Constrained Independence System and Triangulations of Planar Point Sets," Proc. Computing and Combinatorics, pp. 41-50, 1995.
[26]
S. Henikoff and J. Henikoff, "Amino Acid Substitution Matrices from Protein Blocks," Proc. Nat'l Academy of Sciences USA, vol. 89, pp. 10915-10919, 1992.
[27]
S. Henikoff and J. Henikoff, "Automated Assembly of Protein Blocks for Database Searching," Nucleic Acids Research, vol. 19, no. 23, pp. 6565-6572, 1991.
[28]
J. Buhler, U. Keich, and Y. Sun, "Designing Seeds for Similarity Search in Genomic DNA," Proc. Seventh Ann. Int'l Conf. Computational Molecular Biology (RECOMB '03), pp. 67-75, Apr. 2003.
[29]
L. Ilie and S. Ilie, "Long Spaced Seeds for Finding Similarities Between Biological Sequences," Proc. Second Int'l Conf. Bioinformatics & Computational Biology (BIOCOMP '07), pp. 3-8, 2007.
[30]
A. Bahr, J. Thompson, J. Thierry, and O. Poch, "BAliBASE (Benchmark Alignment dataBASE): Enhancements for Repeats, Transmembrane Sequences and Circular Permutations," Nucleic Acids Research, vol. 29, no. 1, pp. 323-326, 2001.
[31]
A. Stebbings and K. Mizuguchi, "HOMSTRAD: Recent Developments of the Homologous Protein Structure Alignment Database," Nucleic Acids Research, vol. 32, pp. D203-D207, 2004.
[32]
A.R. Subramanian, J. Weyer-Menkhoff, M. Kaufmann, and B. Morgenstern, "DIALIGN-T: An Improved Algorithm for Segment-Based Multiple Sequence Alignment," BMC Bioinformatics, vol. 6, no. 66, 2005.
[33]
G. Raghava, S. Searle, P. Audley, J. Barber, and G. Barton, "OXBench: A Benchmark for Evaluation of Protein Multiple Sequence Alignment Accuracy," BMC Bioinformatics, vol. 4, no. 47, 2003.
[34]
R. Finn, J. Mistry, B. Schuster-Bckler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S. Eddy, E. Sonnhammer, and A. Bateman, "PFAM: Clans, Web Tools and Services," Nucleic Acids Research, vol. 34, pp. D247-D251, 2006.
[35]
R.C. Edgar, "MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput," Nucleic Acids Research, vol. 32, no. 5, pp. 1792-1797, 2004.
[36]
I. Letunic, R. Copley, B. Pils, S. Pinkert, J. Schultz, and P. Bork, "SMART 5: Domains in the Context of Genomes and Networks," Nucleic Acids Research, vol. 34, no. 1, pp. D257-D260, 2006.
[37]
R. Nunez Miguel, J. Shi, and K. Mizuguchi, "Protein Fold Recognition and Comparative Modeling using HOMSTRAD, JOY and FUGUE," Protein Structure Prediction: Bioinformatic Approach, pp. 143-169, Int'l University Line Publishers, 2001.
[38]
S. Shiryev, J. Papadopoulos, A. Schäffer, and R. Agarwala, "Improved BLAST Searches Using Longer Words for Protein Seeding," Bioinformatics, vol. 23, no. 21, pp. 2949-2951, 2007.
[39]
P. Peterlongo, L. Noé, D. Lavenier, N.V.H., G. Kucherov, and M. Giraud, "Optimal Neighborhood Indexing for Protein Similarity Search," BMC Bioinformatics, vol. 9, no. 534, 2008.

Cited By

View all
  • (2021)Improved DNA-versus-Protein Homology Search for Protein FossilsAlgorithms for Computational Biology10.1007/978-3-030-74432-8_11(146-158)Online publication date: 7-Jun-2021
  • (2018)A Simplified Description of Child Tables for Sequence Similarity SearchIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2018.279606415:6(2067-2073)Online publication date: 1-Nov-2018
  • (2015)Parallel implementation of MAFFT on CUDA-enabled graphics hardwareIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2014.235180112:1(205-218)Online publication date: 1-Jan-2015

Recommendations

Comments

Information & Contributors

Information

Published In

IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 6, Issue 3
July 2009
159 pages

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 July 2009
Published in TCBB Volume 6, Issue 3

Author Tags

  1. Protein sequences
  2. local alignment
  3. multiple seeds
  4. protein databases
  5. seed alphabet
  6. seeds
  7. selectivity.
  8. sensitivity
  9. similarity search
  10. subset seeds

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)2
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Improved DNA-versus-Protein Homology Search for Protein FossilsAlgorithms for Computational Biology10.1007/978-3-030-74432-8_11(146-158)Online publication date: 7-Jun-2021
  • (2018)A Simplified Description of Child Tables for Sequence Similarity SearchIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2018.279606415:6(2067-2073)Online publication date: 1-Nov-2018
  • (2015)Parallel implementation of MAFFT on CUDA-enabled graphics hardwareIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2014.235180112:1(205-218)Online publication date: 1-Jan-2015

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media