skip to main content
article

Extracting Protein Interactions from Text with the Unified AkaneRE Event Extraction System

Published: 01 July 2010 Publication History

Abstract

Currently, relation extraction (RE) and event extraction (EE) are the two main streams of biological information extraction. In 2009, the majority of these RE and EE research efforts were centered around the BioCreative II.5 Protein-Protein Interaction (PPI) challenge and the “BioNLP event extraction shared task.” Although these challenges took somewhat different approaches, they share the same ultimate goal of extracting bio-knowledge from the literature. This paper compares the two challenge task definitions, and presents a unified system that was successfully applied in both these and several other PPI extraction task settings. The AkaneRE system has three parts: A core engine for RE, a pool of modules for specific solutions, and a configuration language to adapt the system to different tasks. The core engine is based on machine learning, using either Support Vector Machines or Statistical Classifiers and features extracted from given training data. The specific modules solve tasks like sentence boundary detection, tokenization, stemming, part-of-speech tagging, parsing, named entity recognition, generation of potential relations, generation of machine learning features for each relation, and finally, assignment of confidence scores and ranking of candidate relations. With these components, the AkaneRE system produces state-of-the-art results, and the system is freely available for academic purposes at http://www-tsujii.is.s.u-tokyo.ac.jp/satre/akane/.

References

[1]
R. Bunescu, R. Ge, R.J. Kate, E.M. Marcotte, R.J. Mooney, A.K. Ramani, and Y.W. Wong, "Comparative Experiments on Learning Information Extractors for Proteins and Their Interactions," J. Artificial Intelligence in Medicine, special issue on summarization and information extraction from medical documents, http:// www.ncbi.nlm.nih.gov/ /15811782, 2004.
[2]
R. Sætre, K. Sagae, and J. Tsujii, "Syntactic Features for Protein-Protein Interaction Extraction," Proc. Second Int'l Symp. Languages in Biology and Medicine (LBM '07), C.J. Baker and S. Jian, eds., CEUR Workshop Proc. (CEUR-WS.org), vol. 319, pp. 6.1-6.14, http://sunsite.informatik.rwth-aachen.de/Publications/CEURWS/ Vol-319/Paper6.pdf, Jan. 2008.
[3]
M. Miwa, R. Sætre, Y. Miyao, and J. Tsujii, "A Rich Feature Vector for Protein-Protein Interaction Extraction from Multiple Corpora," Proc. 2009 Conf. Empirical Methods in Natural Language Processing, pp. 121-130, http://www.aclweb.org/anthology/D/D09/ D09-1013.pdf, Aug. 2009.
[4]
R. Sætre, M. Miwa, K. Yoshida, and J. Tsujii, "From Protein-Protein Interaction to Molecular Event Extraction," Proc. Natural Language Processing in Biomedicine (BioNLP) NAACL 2009 Workshop, pp. 103-106, http://www-tsujii.is.s.u-tokyo.ac.jp/~satre/ papers/bioShared2009_satre.pdf, 2009.
[5]
R. Kabiljo, A. Clegg, and A. Shepherd, "A Realistic Assessment of Methods for Extracting Gene/Protein Interactions from Free Text," BMC Bioinformatics, vol. 10, no. 1, July 2009, http:// dx.doi.org/10.1186/1471-2105-10-233.
[6]
Y. Niu, D. Otasek, and I. Jurisica, "Evaluation of Linguistic Features Useful in Extraction of Interactions from Application to Annotating Known and High-Throughput, Predicted Interactions in I2D," Bioinformatics, vol. 26, no. 1, pp. 111-119, Jan. 2010, http://dx.doi.org/10.1093/bioinformatics/btp602.
[7]
T. Fayruzov, M. De Cock, C. Cornelis, and V. Hoste, "The Role of Syntactic Features in Protein Interaction Extraction," Proc. Second Int'l Workshop Data and Text Mining in Bioinformatics, http:// portal.acm.org/citation.cfm?id=1458463, 2008.
[8]
S. Van Landeghem, Y. Saeys, B. De Baets, and Y. Van de Peer, "Extracting Protein-Protein Interactions from Text Using Rich Feature Vectors and Feature Selection," Proc. Third Int'l Symp. Semantic Mining in Biomedicine (SMBM '08), T. Salakoski, D. Rebholz-Schuhmann, and S. Pyysalo, eds., pp. 77-84, http:// mars.cs.utu.fi/smbm2008/files/smbm2008proceedings/ smbmpaper_4.pdf, 2008.
[9]
F. Leitner, M. Krallinger, C. Rodriguez-Penagos, J. Hakenberg, C. Plake, C.-J. Kuo, C.-N. Hsu, R.T.-H. Tsai, H.-C. Hung, W.W. Lau, C.A. Johnson, R. Sætre, K. Yoshida, Y.H. Chen, S. Kim, S.-Y. Shin, B.-T. Zhang, W.A. Baumgartner, Jr., L. Hunter, B. Haddow, M. Matthew, X. Wang, P. Ruch, F. Ehrler, A. Ozgur, G. Erkan, D.R. Radev, M. Krauthammer, T. Luong, R. Hoffmann, C. Sander, and A. Valencia, "Introducing Meta-Services for Biomedical Information Extraction," Genome Biology, vol. 9, no. S2, special issue on the biocreative challenge evaluation, http://genomebiology.com/ 2008/9/S2/S6, 2008.
[10]
P. Roberts, A. Cohen, and W. Hersh, "Tasks, Topics Relevance Judging for the TREC Genomics Track: Five Years of Experience Evaluating Biomedical Text Information Retrieval Systems," Information Retrieval, vol. 12, no. 1, pp. 81-97, http://www. springerlink.com/content/940478r304656141/, 2009.
[11]
K. Fundel, R. Kuffner, and R. Zimmer, "RelEx-Relation Extraction Using Dependency Parse Tree," Bioinformatics, vol. 23, no. 3, pp. 365-371, Feb. 2007, http://dx.doi.org/ 10.1093/bioinformatics/btl616.
[12]
S. Kim, S.-Y. Shin, I.-H. Lee, S.-J. Kim, R. Sriram, and B.-T. Zhang, "Pie: An Online Prediction System for Protein-Protein Interactions from Text," Nucleic Acids Research, vol. 36, no. Suppl_2, pp. W411- W415, July 2008, http://dx.doi.org/10.1093/nar/gkn281.
[13]
P. Palaga, L. Nguyen, U. Leser, and J. Hakenberg, "High-Performance Information Extraction with AliBaba," Proc. 12th Int'l Conf. Extending Database Technology (EDBT '09), pp. 1140-1143, 2009, http://doi.acm.org/10.1145/1516360.1516498.
[14]
L. Hunter, Z. Lu, J. Firby, W. Baumgartner, H. Johnson, P. Ogren, and K.B. Cohen, "OpenDMAP: An Open Source Ontology-Driven Concept Analysis Engine with Applications to Capturing Knowledge Regarding Protein Transport Protein Interactions and Cell-Type-Specific Gene Expression," BMC Bioinformatics, vol. 9, no. 1, Jan. 2008, http://dx.doi.org/ 10.1186/1471-2105-9-78.
[15]
R. Chowdhary, J. Zhang, and J.S. Liu, "Bayesian Inference of Protein-Protein Interactions from Biological Literature," Bioinformatics, vol. 25, no. 12, pp. 1536-1542, June 2009, http:// dx.doi.org/10.1093/bioinformatics/btp245.
[16]
M. Krallinger, A. Morgan, L. Smith, F. Leitner, L. Tanabe, J. Wilbur, L. Hirschman, and A. Valencia, "Evaluation of Text-Mining Systems for Biology: Overview of the Second Biocreative Community Challenge," Genome Biology, vol. 9, no. S2, 2008, http://dx.doi.org/10.1186/gb-2008-9-s2-s1.
[17]
F. Leitner and A. Valencia, "A Text-Mining Perspective on the Requirements for Electronically Annotated Abstracts," FEBS Letters, vol. 582, no. 8, pp. 1178-1181, Apr. 2008, http:// dx.doi.org/10.1016/j.febslet.2008.02.072.
[18]
J.-D. Kim, T. Ohta, S. Pyysalo, Y. Kano, and J. Tsujii, "Overview of Bionlp '09 Shared Task on Event Extraction," Proc. Natural Language Processing in Biomedicine (BioNLP) 2009 Workshop Companion Volume for Shared Task, pp. 1-9, http://www. aclweb.org/anthology/W/W09/W09-1401.pdf, 2009.
[19]
S. Pyysalo, F. Ginter, J. Heimonen, J. Bjorne, J. Boberg, J. Jarvinen, and T. Salakoski, "BioInfer: A Corpus for Information Extraction in the Biomedical Domain," BMC Bioinformatics, vol. 8, no. 1, 2007, http://dx.doi.org/10.1186/1471-2105-8-50.
[20]
J. Ding, D. Berleant, D. Nettleton, and E. Wurtele, "Mining MEDLINE: Abstracts Sentences, or Phrases?" Proc. Pacific Symp. Biocomputing, pp. 326-337, http://view.ncbi.nlm.nih.gov/ /11928487, 2002.
[21]
C. Nédellec, "Learning Language in Logic--Genic Interaction Extraction Challenge," Proc. Fourth Learning Language in Logic Workshop (LLL '05), J. Cussens and C. Nédellec, eds., pp. 31-37, http://www.cs.york.ac.uk/aig/lll/lll05/lll05-nedellec.pdf, Aug. 2005.
[22]
S. Pyysalo, A. Airola, J. Heimonen, J. Bjorne, F. Ginter, and T. Salakoski, "Comparative Analysis of Five Protein-Protein Interaction Corpora," BMC Bioinformatics, vol. 9, no. Suppl 3, 2008, http://dx.doi.org/10.1186/1471-2105-9-S3-S6.
[23]
J.D. Kim, T. Ohta, and J. Tsujii, "Corpus Annotation for Mining Biomedical Events from Literature," BMC Bioinformatics, vol. 9, no. 1, 2008, http://dx.doi.org/10.1186/1471-2105-9-10.
[24]
A. Yakushiji, "Relation Information Extraction Using Deep Syntactic Analysis," PhD dissertation, Univ. of Tokyo, http:// www-tsujii.is.s.u-tokyo.ac.jp/~akane/papers/dissertation_ yakushiji.pdf, 2006.
[25]
R. Sætre, K. Yoshida, A. Yakushiji, Y. Miyao, Y. Matsubyashi, and T. Ohta, "AKANE System: Protein-Protein Interaction Pairs in BioCreAtIvE2 Challenge PPI-IPS Subtask," Proc. Second BioCreative Challenge Evaluation Workshop, L. Hirschman, M. Krallinger, and A. Valencia, eds., pp. 209-212, http://www-tsujii.is.s.uto kyo.ac.jp/~satre/papers/BC2_PPI_IPS_T19_BC2.pdf, Apr. 2007.
[26]
Y. Kano, N. Nguyen, R. Sætre, K. Yoshida, Y. Miyao, Y. Tsuruoka, Y. Matsubayashi, S. Ananiadou, and J. Tsujii, "Filling the Gaps between Tools Users: A Tool Comparator and Using Protein-Protein Interactions as an Example," Proc. Pacific Symp. Biocomputing (PSB), no. 13, pp. 616-627, http://psb.stanford.edu/psbonline/ proceedings/psb08/kano.pdf, Jan. 2008.
[27]
Y. Miyao, K. Sagae, R. Sætre, T. Matsuzaki, and J. Tsujii, "Evaluating Contributions of Natural Language Parsers to Protein-Protein Interaction Extraction," Bioinformatics, vol. 25, no. 3, pp. 394-400, http://bioinformatics.oxfordjournals.org/ cgi/content/abstract/25/3/394, 2009.
[28]
M. Miwa, R. Sætre, Y. Miyao, and J. Tsujii, "Protein-Protein Interaction Extraction by Leveraging Multiple Kernels and Parsers," Int'l J. Medical Informatics, Special Issue on Mining of Clinical and Biomedical Text and Data, vol. 78, no. 12, pp. e39-e46, http://www.ijmijournal.com/article/S1386-5056%2809%2900076- 8/, 2009.
[29]
D. Ferrucci and A. Lally, "UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment," Natural Language Eng., vol. 10, nos. 3/4, pp. 327- 348, http://portal.acm.org/citation.cfm?id=1030318.1030325, 2004.
[30]
R. Sætre, Akane System Home Page, http://www-tsujii.is.s.utokyo. ac.jp/~satre/akane/, 2009.
[31]
H. Hermjakob, L. Montecchi-Palazzi, G. Bader, R. Wojcik, L. Salwinski, A. Ceol, S. Moore, S. Orchard, U. Sarkans, C. von Mering, B. Roechert, S. Poux, E. Jung, H. Mersch, P. Kersey, M. Lappe, Y. Li, R. Zeng, D. Rana, M. Nikolski, H. Husi, C. Brun, K. Shanker, S. Grant, C. Sander, P. Bork, W. Zhu, A. Pandey, A. Brazma, B. Jacq, M. Vidal, D. Sherman, P. Legrain, G. Cesareni, L. Xenarios, D. Eisenberg, B. Steipe, C. Hogue, and R. Apweiler, "The HUPOPSI's Molecular Interaction Format--a Community Standard for the Representation of Protein Interaction Data," Nature Biotechnology, vol. 22, no. 2, pp. 177-183, http:// www.ncbi.nlm.nih.gov/ /14755292, Feb. 2004.
[32]
U. Hahn, E. Buyko, K. Tomanek, S. Piao, J. McNaught, Y. Tsuruoka, and S. Ananiadou, "An Annotation Type System for a Data-Driven NLP Pipeline," Proc. Linguistic Annotation Workshop, pp. 33-40, http://www.aclweb.org/anthology/W/W07/ W07-1505.pdf, June 2007.
[33]
W.A. Baumgartner, B.K. Cohen, and L. Hunter, "An Open-Source Framework for Large-Scale and Flexible Evaluation of Biomedical Text Mining Systems," J. Biomedical Discovery and Collaboration, vol. 3, Jan. 2008, http://dx.doi.org/10.1186/1747-5333-3-1.
[34]
Y. Kano, W.A. Baumgartner, L. McCrohon, S. Ananiadou, K.B. Cohen, L. Hunter, and J. Tsujii, "U-Compare: Share and Compare Text Mining Tools with Uima," Bioinformatics, vol. 25, no. 15, pp. 1997-1998, Aug. 2009, http://dx.doi.org/10.1093/ bioinformatics/btp289.
[35]
J.-D. Kim, T. Ohta, Y. Tateishi, and J. Tsujii, "GENIA Corpus--a Semantically Annotated Corpus for Bio-Textmining," Bioinformatics, vol. 19, no. Suppl. 1, pp. i180-i182, http://bioinformatics. oupjournals.org/cgi/content/abstract/19/suppl_1/i180, 2003.
[36]
T. Hara, Y. Miyao, and J. Tsujii, "Adapting a Probabilistic Disambiguation Model of an HPSG Parser to a New Domain," Proc. Int'l Joint Conf. Natural Language Processing (IJCNLP '05), R. Dale, K.-F. Wong, J. Su, and O.Y. Kwong, eds., pp. 199-210, http://www-tsujii.is.s.u-tokyo.ac.jp/~harasan/papers/harasan- IJCNLP2005.pdf, Oct. 2005.
[37]
R. Apweiler, A. Bairoch, C.H. Wu, W.C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M.J. Martin, D.A. Natale, C. O'Donovan, N. Redaschi, and L.-S.L. Yeh, "UniProt: The Universal Protein Knowledgebase," Nucleic Acids Research, vol. 32, no. Suppl_1, pp. D115-D119, Jan. 2004, http:// dx.doi.org/10.1093/nar/gkh131.
[38]
D. Maglott, J. Ostell, K.D. Pruitt, and T. Tatusova, "Entrez Gene: Gene-Centered Information at NCBI," Nucleic Acids Research, vol. 33, no. Suppl_1, pp. D54-D58, Jan. 2005, http:// dx.doi.org/10.1093/nar/gki031.
[39]
A. Koike and T. Takagi, "Gene/Protein/Family Name Recognition in Biomedical Literature," Proc. BioLINK 2004: Linking Biological Literature, Ontologies, and Databases, pp. 9-16, http:// www.cs.brandeis.edu/~jamesp/biolink2004/papers/pdf/ BIO002.pdf, 2004.
[40]
T. Joachims, "Optimizing Search Engines Using Clickthrough Data," Proc. ACM SIGKDD, pp. 133-142, 2002, http://doi.acm. org/10.1145/775047.775067.
[41]
A. Moschitti, "Making Tree Kernels Practical for Natural Language Learning," Proc. Conf. European Chapter of the Assoc. for Computational Linguistics (EACL), http://acl.ldc.upenn.edu/E/ E06/E06-1015.pdf, 2006.
[42]
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, "LIBLINEAR: A Library for Large Linear Classification," J. Machine Learning Research, vol. 9, pp. 1871-1874, http:// www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf, 2008.
[43]
L.A. Hirschman, S.A. Mardis, G. Cesareni, M. Krallinger, F. Leitner, and A. Valencia, "An Overview of BioCreative II.5," IEEE/ ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 3, pp. 385-399, July-Sept 2010.

Cited By

View all
  • (2014)Gene name disambiguation using multi-scope species detectionIEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)10.1109/TCBB.2013.13911:1(55-62)Online publication date: 1-Jan-2014
  • (2012)Tree kernel-based protein-protein interaction extraction from biomedical literatureJournal of Biomedical Informatics10.1016/j.jbi.2012.02.00445:3(535-543)Online publication date: 1-Jun-2012

Recommendations

Comments

Information & Contributors

Information

Published In

IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 7, Issue 3
July 2010
192 pages

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 July 2010
Published in TCBB Volume 7, Issue 3

Author Tags

  1. Text mining
  2. bioinformatics (genome or protein) databases.
  3. language parsing and understanding
  4. machine learning

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)1
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2014)Gene name disambiguation using multi-scope species detectionIEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)10.1109/TCBB.2013.13911:1(55-62)Online publication date: 1-Jan-2014
  • (2012)Tree kernel-based protein-protein interaction extraction from biomedical literatureJournal of Biomedical Informatics10.1016/j.jbi.2012.02.00445:3(535-543)Online publication date: 1-Jun-2012

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media