research-article

Efficient keyword extraction for meaningful document perception

Authors:

Sebastian Rönnau,

Uwe M. BorghoffAuthors Info & Claims

DocEng '11: Proceedings of the 11th ACM symposium on Document engineering

Pages 185 - 194

https://doi.org/10.1145/2034691.2034732

Published: 19 September 2011 Publication History

Abstract

Keyword extraction is a common technique in the domain of information retrieval. Keywords serve as a minimalistic summary for single documents or document collections, enabling the reader to quickly perceive the main contents of a text. However, they are often not readily available for the documents of interest.

Common keyword extraction techniques demand either a large data collection, a learning process, or access to extensive amounts of reference data. By relying on additional linguistic features (e.g. stop word removal), most approaches are language-restricted. Moreover, the extracted keywords usually pertain to the entire document, rather than only to the portion that is of interest to the reader.

In this paper, we present an efficient and flexible approach to summarize selections of text within a document. Our solution is based on a keyword extraction algorithm that is applicable to a variety of documents, regardless of language or context. This algorithm relies on the Helmholtz principle and extends a recently presented approach. Our extension covers the features of a weighting algorithm while providing a self-regulation capability to allow for more meaningful results. Furthermore, our approach takes into account the document structure in order to enhance pure statistic summarizations. We evaluate the efficiency of our approach and present results with meaningful examples. In addition, we outline further applications of our approach that allow for enhanced document perception as well as for meaningful document indexing and retrieval.

References

[1]

J. Allan, J. Carbonell, G. Doddington, J. Yamron, Y. Yang, B. Archibald, and M. Scudder. Topic Detection and Tracking Pilot Study Final Report. In Proc. of the DARPA Broadcast News Transcription and Understanding Workshop, 1998.

[2]

S. M. Aluísio, L. Specia, T. A. Pardo, E. G. Maziero, and R. P. Fortes. Towards Brazilian Portuguese automatic text simplification systems. In Proc. of the 8th ACM sym. on Document engineering - DocEng '08, page 240, New York, NY, USA, Sept. 2008. ACM.

Digital Library

[3]

G. Amati and C. J. Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20:357--389, October 2002.

Digital Library

[4]

E. D. Avanzo, B. Magnini, and A. Vallin. Keyphrase Extraction for Summarization Purposes: The LAKE System at DUC-2004. In Document Understanding Conferences, Boston, USA, 2004.

[5]

A. A. Balinsky, H. Y. Balinsky, and S. J. Simske. On helmholtz's principle for documents processing. In Proc. of the 10th ACM symp. on Document engineering, DocEng '10, page 283, New York, NY, USA, 2010. ACM.

Digital Library

[6]

R. Barzilay and M. Elhadad. Using Lexical Chains for Text Summarization, 1997.

[7]

J. M. Conroy and D. P. O'leary. Text summarization via hidden Markov models. In Proc. of the 24th ann. int. ACM SIGIR conf. on Research and development in information retrieval - SIGIR '01, pages 406--407, New York, NY, USA, Sept. 2001. ACM.

Digital Library

[8]

H. Fang, T. Tao, and C. Zhai. A formal study of information retrieval heuristics. In Proc. of the 27th ann. int. conf. on Research and development in information retrieval - SIGIR '04, page 49, New York, NY, USA, July 2004. ACM.

Digital Library

[9]

W. Feller. An Introduction to Probability Theory and Its Applications, Vol. 2. Wiley, 1967.

[10]

E. Frank, G. W. Paynter, I. H. Witten, C. Gutwin, and C. G. Nevill-Manning. Domain-Specific Keyphrase Extraction. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI '99, pages 668--673. Morgan Kaufmann Publishers Inc., July 1999.

Digital Library

[11]

Q. He, K. Chang, E.-P. Lim, and A. Banerjee. Keep It Simple with Time: A Reexamination of Probabilistic Topic Detection Models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(10):1795--1808, 2010.

Digital Library

[12]

M. A. Hearst. TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages. Computational Linguistics, 1997.

Digital Library

[13]

M. A. Hearst and D. Rosner. Tag Clouds: Data Analysis Tool or Social Signaller? In Proc. of the 41st Hawaii Int. Conf. on System Sciences, pages 1--10, 2008.

Digital Library

[14]

S. Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11--21, 1972.

[15]

J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, Sept. 1999.

Digital Library

[16]

G. Koutrika, Z. M. Zadeh, and H. Garcia-Molina. CourseCloud. In Proc. of the 12th Int. Conf. on Extending Database Technology Advances in Database Technology - EDBT '09, EDBT '09, page 1132, New York, NY, USA, 2009. ACM.

Digital Library

[17]

N. Kumar and K. Srinathan. Automatic keyphrase extraction from scientific documents using N-gram filtration technique. In Proc. of the 8th ACM symp. on Document engineering - DocEng '08, page 199, New York, NY, USA, 2008. ACM.

Digital Library

[18]

B. Y.-L. Kuo, T. Hentrich, B. M. . Good, and M. D. Wilkinson. Tag clouds for summarizing web search results. In Proc. of the 16th int. conf. on World Wide Web, WWW '07, pages 1203--1204, New York, NY, USA, 2007. ACM.

Digital Library

[19]

M. Litvak and M. Last. Graph-based keyword extraction for single-document summarization. In MMIES '08 Proc. of the Workshop on Multi-source Multilingual Information Extraction and Summarization, pages 17--24, Aug. 2008.

Digital Library

[20]

T. Pagerank, C. Ranking, and B. Order. The PageRank Citation Ranking: Bringing Order to the Web. World Wide Web Internet And Web Information Systems, pages 1--17, 1998.

[21]

S. Rönnau and U. Borghoff. Xcc: change control of xml documents. Computer Science - Research and Development, pages 1--17, 2010. 10.1007/s00450-010-0140-2.

[22]

A. Singhal. Modern Information Retrieval : A Brief Overview. IEEE Data Engineering Bulletin 24, pages 35--43, 2001.

[23]

Z. Teng, Y. Liu, F. Ren, S. Tsuchiya, and F. Ren. Single document summarization based on local topic identification and word frequency. In Artificial Intelligence, 2008. MICAI '08. Seventh Mexican International Conference on, pages 37--41, oct. 2008.

Digital Library

[24]

P. D. Turney. Learning algorithms for keyphrase extraction. Inf. Retr., 2:303--336, May 2000.

Digital Library

[25]

V. R. Uzêda, T. A. S. Pardo, and M. D. G. V. Nunes. Evaluation of Automatic Text Summarization Methods Based on Rhetorical Structure Theory. In 8th Int. Conf. on Intelligent Systems Design and Applications, pages 389--394. Ieee, Nov. 2008.

Digital Library

[26]

X. Wan and J. Xiao. Exploiting neighborhood knowledge for single document summarization and keyphrase extraction. ACM Transactions on Information Systems, 28(2):1--34, May 2010.

Digital Library

[27]

D. Watters. Meaninful Clouds: Towards a Novel Interface for Document Visualization, 2009.

[28]

I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning. KEA: practical automatic keyphrase extraction. In Proc. of the 4th ACM conf. on Digital libraries - DL '99, pages 254--255, New York, NY, USA, Aug. 1999. ACM.

Digital Library

[29]

G. Xexeo, F. Morgado, and P. Fiuza. Differential Tag Clouds: Highlighting Particular Features in Documents. In 2009 IEEE/WIC/ACM Int. Joint Conf. on Web Intelligence and Intelligent Agent Technology, pages 129--132. IEEE, Sept. 2009.

Digital Library

[30]

C. Yang. Discovering Event Evolution Graphs From News Corpora. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 39(4):850--863, July 2009.

Digital Library

Cited By

Borghoff UNitzl C(2024)Die Informatik und die KriseInformatik Spektrum10.1007/s00287-024-01567-xOnline publication date: 3-Jun-2024
https://doi.org/10.1007/s00287-024-01567-x
Rinaldi ARusso C(2020)Using a multimedia semantic graph for web document visualization and summarizationMultimedia Tools and Applications10.1007/s11042-020-09761-1Online publication date: 24-Sep-2020
https://doi.org/10.1007/s11042-020-09761-1
Rinaldi A(2019)Web Summarization and Browsing Through Semantic Tag CloudsInternational Journal of Intelligent Information Technologies10.4018/IJIIT.201907010115:3(1-23)Online publication date: 1-Jul-2019
https://doi.org/10.4018/IJIIT.2019070101
Show More Cited By

Index Terms

Efficient keyword extraction for meaningful document perception
1. Information systems
  1. Information retrieval

Recommendations

Evaluating Keyword Selection Methods for WEBSOM Text Archives

Abstract--The WEBSOM methodology, proven effective for building very large text archives, includes a method that extracts labels for each document cluster assigned to nodes in the map. However, the WEBSOM method needs to retrieve all the words of all ...
Keyword extraction and clustering for document recommendation in conversations

This paper addresses the problem of keyword extraction from conversations, with the goal of using these keywords to retrieve, for each short conversation fragment, a small number of potentially relevant documents, which can be recommended to ...
Automatic office document classification and information extraction

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DocEng '11: Proceedings of the 11th ACM symposium on Document engineering

September 2011

296 pages

ISBN:9781450308632

DOI:10.1145/2034691

Conference Chair:
Matthew Hardy
Adobe Systems, Inc., USA
,
Program Chair:
Frank Wm. Tompa
University of Waterloo, Canada

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

In-Cooperation

SIGDOC: ACM Special Interest Group for Design of Communications

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 September 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

DocEng '11

Sponsor:

SIGWEB

DocEng '11: ACM Symposium on Document Engineering

September 19 - 22, 2011

California, Mountain View, USA

Acceptance Rates

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
565
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Borghoff UNitzl C(2024)Die Informatik und die KriseInformatik Spektrum10.1007/s00287-024-01567-xOnline publication date: 3-Jun-2024
https://doi.org/10.1007/s00287-024-01567-x
Rinaldi ARusso C(2020)Using a multimedia semantic graph for web document visualization and summarizationMultimedia Tools and Applications10.1007/s11042-020-09761-1Online publication date: 24-Sep-2020
https://doi.org/10.1007/s11042-020-09761-1
Rinaldi A(2019)Web Summarization and Browsing Through Semantic Tag CloudsInternational Journal of Intelligent Information Technologies10.4018/IJIIT.201907010115:3(1-23)Online publication date: 1-Jul-2019
https://doi.org/10.4018/IJIIT.2019070101
Timanta Tarigan JZamzami EJaya IMelvani Hardi S(2019)Keyword Based System to Enhance the Efficiency of Student’s Performance Report in Computer Science EducationJournal of Physics: Conference Series10.1088/1742-6596/1235/1/0120901235(012090)Online publication date: 23-Jul-2019
https://doi.org/10.1088/1742-6596/1235/1/012090
Shrawankar UWankhede K(2017)News Headline Building using Hybrid Headline Generation Technique for Quick GistInternational Journal of Natural Computing Research10.4018/IJNCR.20170101036:1(36-52)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.4018/IJNCR.2017010103
Zhang XAn JLiu W(2017)Research and implementation of keyword extraction algorithm based on professional background knowledge2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)10.1109/CISP-BMEI.2017.8302332(1-5)Online publication date: Oct-2017
https://doi.org/10.1109/CISP-BMEI.2017.8302332
Rinaldi A(2013)Document Summarization Using Semantic CloudsProceedings of the 2013 IEEE Seventh International Conference on Semantic Computing10.1109/ICSC.2013.26(100-103)Online publication date: 16-Sep-2013
https://dl.acm.org/doi/10.1109/ICSC.2013.26
ajgalík MBarla MBieliková M(2013)From Ambiguous Words to Key-Concept ExtractionProceedings of the 2013 24th International Workshop on Database and Expert Systems Applications10.1109/DEXA.2013.16(63-67)Online publication date: 26-Aug-2013
https://dl.acm.org/doi/10.1109/DEXA.2013.16
Wuehrer GSmejkal A(2013)The knowledge domain of the academy of international business studies (AIB) conferencesScientometrics10.1007/s11192-012-0909-095:2(541-561)Online publication date: 1-May-2013
https://dl.acm.org/doi/10.1007/s11192-012-0909-0
Navrat PSabo S(2012)What's going on out there right now? A beehive based machine to give snapshot of the ongoing stories on the Web2012 Fourth World Congress on Nature and Biologically Inspired Computing (NaBIC)10.1109/NaBIC.2012.6402257(168-174)Online publication date: Nov-2012
https://doi.org/10.1109/NaBIC.2012.6402257

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten