skip to main content
10.1145/2034691.2034732acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

Efficient keyword extraction for meaningful document perception

Published: 19 September 2011 Publication History

Abstract

Keyword extraction is a common technique in the domain of information retrieval. Keywords serve as a minimalistic summary for single documents or document collections, enabling the reader to quickly perceive the main contents of a text. However, they are often not readily available for the documents of interest.
Common keyword extraction techniques demand either a large data collection, a learning process, or access to extensive amounts of reference data. By relying on additional linguistic features (e.g. stop word removal), most approaches are language-restricted. Moreover, the extracted keywords usually pertain to the entire document, rather than only to the portion that is of interest to the reader.
In this paper, we present an efficient and flexible approach to summarize selections of text within a document. Our solution is based on a keyword extraction algorithm that is applicable to a variety of documents, regardless of language or context. This algorithm relies on the Helmholtz principle and extends a recently presented approach. Our extension covers the features of a weighting algorithm while providing a self-regulation capability to allow for more meaningful results. Furthermore, our approach takes into account the document structure in order to enhance pure statistic summarizations. We evaluate the efficiency of our approach and present results with meaningful examples. In addition, we outline further applications of our approach that allow for enhanced document perception as well as for meaningful document indexing and retrieval.

References

[1]
J. Allan, J. Carbonell, G. Doddington, J. Yamron, Y. Yang, B. Archibald, and M. Scudder. Topic Detection and Tracking Pilot Study Final Report. In Proc. of the DARPA Broadcast News Transcription and Understanding Workshop, 1998.
[2]
S. M. Aluísio, L. Specia, T. A. Pardo, E. G. Maziero, and R. P. Fortes. Towards Brazilian Portuguese automatic text simplification systems. In Proc. of the 8th ACM sym. on Document engineering - DocEng '08, page 240, New York, NY, USA, Sept. 2008. ACM.
[3]
G. Amati and C. J. Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20:357--389, October 2002.
[4]
E. D. Avanzo, B. Magnini, and A. Vallin. Keyphrase Extraction for Summarization Purposes: The LAKE System at DUC-2004. In Document Understanding Conferences, Boston, USA, 2004.
[5]
A. A. Balinsky, H. Y. Balinsky, and S. J. Simske. On helmholtz's principle for documents processing. In Proc. of the 10th ACM symp. on Document engineering, DocEng '10, page 283, New York, NY, USA, 2010. ACM.
[6]
R. Barzilay and M. Elhadad. Using Lexical Chains for Text Summarization, 1997.
[7]
J. M. Conroy and D. P. O'leary. Text summarization via hidden Markov models. In Proc. of the 24th ann. int. ACM SIGIR conf. on Research and development in information retrieval - SIGIR '01, pages 406--407, New York, NY, USA, Sept. 2001. ACM.
[8]
H. Fang, T. Tao, and C. Zhai. A formal study of information retrieval heuristics. In Proc. of the 27th ann. int. conf. on Research and development in information retrieval - SIGIR '04, page 49, New York, NY, USA, July 2004. ACM.
[9]
W. Feller. An Introduction to Probability Theory and Its Applications, Vol. 2. Wiley, 1967.
[10]
E. Frank, G. W. Paynter, I. H. Witten, C. Gutwin, and C. G. Nevill-Manning. Domain-Specific Keyphrase Extraction. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI '99, pages 668--673. Morgan Kaufmann Publishers Inc., July 1999.
[11]
Q. He, K. Chang, E.-P. Lim, and A. Banerjee. Keep It Simple with Time: A Reexamination of Probabilistic Topic Detection Models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(10):1795--1808, 2010.
[12]
M. A. Hearst. TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages. Computational Linguistics, 1997.
[13]
M. A. Hearst and D. Rosner. Tag Clouds: Data Analysis Tool or Social Signaller? In Proc. of the 41st Hawaii Int. Conf. on System Sciences, pages 1--10, 2008.
[14]
S. Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11--21, 1972.
[15]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, Sept. 1999.
[16]
G. Koutrika, Z. M. Zadeh, and H. Garcia-Molina. CourseCloud. In Proc. of the 12th Int. Conf. on Extending Database Technology Advances in Database Technology - EDBT '09, EDBT '09, page 1132, New York, NY, USA, 2009. ACM.
[17]
N. Kumar and K. Srinathan. Automatic keyphrase extraction from scientific documents using N-gram filtration technique. In Proc. of the 8th ACM symp. on Document engineering - DocEng '08, page 199, New York, NY, USA, 2008. ACM.
[18]
B. Y.-L. Kuo, T. Hentrich, B. M. . Good, and M. D. Wilkinson. Tag clouds for summarizing web search results. In Proc. of the 16th int. conf. on World Wide Web, WWW '07, pages 1203--1204, New York, NY, USA, 2007. ACM.
[19]
M. Litvak and M. Last. Graph-based keyword extraction for single-document summarization. In MMIES '08 Proc. of the Workshop on Multi-source Multilingual Information Extraction and Summarization, pages 17--24, Aug. 2008.
[20]
T. Pagerank, C. Ranking, and B. Order. The PageRank Citation Ranking: Bringing Order to the Web. World Wide Web Internet And Web Information Systems, pages 1--17, 1998.
[21]
S. Rönnau and U. Borghoff. Xcc: change control of xml documents. Computer Science - Research and Development, pages 1--17, 2010. 10.1007/s00450-010-0140-2.
[22]
A. Singhal. Modern Information Retrieval : A Brief Overview. IEEE Data Engineering Bulletin 24, pages 35--43, 2001.
[23]
Z. Teng, Y. Liu, F. Ren, S. Tsuchiya, and F. Ren. Single document summarization based on local topic identification and word frequency. In Artificial Intelligence, 2008. MICAI '08. Seventh Mexican International Conference on, pages 37--41, oct. 2008.
[24]
P. D. Turney. Learning algorithms for keyphrase extraction. Inf. Retr., 2:303--336, May 2000.
[25]
V. R. Uzêda, T. A. S. Pardo, and M. D. G. V. Nunes. Evaluation of Automatic Text Summarization Methods Based on Rhetorical Structure Theory. In 8th Int. Conf. on Intelligent Systems Design and Applications, pages 389--394. Ieee, Nov. 2008.
[26]
X. Wan and J. Xiao. Exploiting neighborhood knowledge for single document summarization and keyphrase extraction. ACM Transactions on Information Systems, 28(2):1--34, May 2010.
[27]
D. Watters. Meaninful Clouds: Towards a Novel Interface for Document Visualization, 2009.
[28]
I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning. KEA: practical automatic keyphrase extraction. In Proc. of the 4th ACM conf. on Digital libraries - DL '99, pages 254--255, New York, NY, USA, Aug. 1999. ACM.
[29]
G. Xexeo, F. Morgado, and P. Fiuza. Differential Tag Clouds: Highlighting Particular Features in Documents. In 2009 IEEE/WIC/ACM Int. Joint Conf. on Web Intelligence and Intelligent Agent Technology, pages 129--132. IEEE, Sept. 2009.
[30]
C. Yang. Discovering Event Evolution Graphs From News Corpora. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 39(4):850--863, July 2009.

Cited By

View all

Index Terms

  1. Efficient keyword extraction for meaningful document perception

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    DocEng '11: Proceedings of the 11th ACM symposium on Document engineering
    September 2011
    296 pages
    ISBN:9781450308632
    DOI:10.1145/2034691
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 September 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. heuristic algorithm
    2. information retrieval
    3. keyword extraction
    4. single document
    5. text data mining

    Qualifiers

    • Research-article

    Conference

    DocEng '11
    Sponsor:
    DocEng '11: ACM Symposium on Document Engineering
    September 19 - 22, 2011
    California, Mountain View, USA

    Acceptance Rates

    Overall Acceptance Rate 194 of 564 submissions, 34%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)13
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Die Informatik und die KriseInformatik Spektrum10.1007/s00287-024-01567-xOnline publication date: 3-Jun-2024
    • (2020)Using a multimedia semantic graph for web document visualization and summarizationMultimedia Tools and Applications10.1007/s11042-020-09761-1Online publication date: 24-Sep-2020
    • (2019)Web Summarization and Browsing Through Semantic Tag CloudsInternational Journal of Intelligent Information Technologies10.4018/IJIIT.201907010115:3(1-23)Online publication date: 1-Jul-2019
    • (2019)Keyword Based System to Enhance the Efficiency of Student’s Performance Report in Computer Science EducationJournal of Physics: Conference Series10.1088/1742-6596/1235/1/0120901235(012090)Online publication date: 23-Jul-2019
    • (2017)News Headline Building using Hybrid Headline Generation Technique for Quick GistInternational Journal of Natural Computing Research10.4018/IJNCR.20170101036:1(36-52)Online publication date: 1-Jan-2017
    • (2017)Research and implementation of keyword extraction algorithm based on professional background knowledge2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)10.1109/CISP-BMEI.2017.8302332(1-5)Online publication date: Oct-2017
    • (2013)Document Summarization Using Semantic CloudsProceedings of the 2013 IEEE Seventh International Conference on Semantic Computing10.1109/ICSC.2013.26(100-103)Online publication date: 16-Sep-2013
    • (2013)From Ambiguous Words to Key-Concept ExtractionProceedings of the 2013 24th International Workshop on Database and Expert Systems Applications10.1109/DEXA.2013.16(63-67)Online publication date: 26-Aug-2013
    • (2013)The knowledge domain of the academy of international business studies (AIB) conferencesScientometrics10.1007/s11192-012-0909-095:2(541-561)Online publication date: 1-May-2013
    • (2012)What's going on out there right now? A beehive based machine to give snapshot of the ongoing stories on the Web2012 Fourth World Congress on Nature and Biologically Inspired Computing (NaBIC)10.1109/NaBIC.2012.6402257(168-174)Online publication date: Nov-2012

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media