Copyright 2023 AIT Austrian Institute of Technology GmbH
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Keyword Extraction is crucial for document analysis and information retrieval, providing concise representation of context and topics within textual data. Manually finding relevant keywords for numerous extensive documents is arduous, error-prone, and time-consuming. To address this challenge, we've developed a prototype leveraging KeyBERT to automate the process, utilizing BERT embeddings to efficiently extract and identify relevant terms for each document, streamlining the keyword extraction workflow.
KeyBERT is a minimal and user-friendly keyword extraction technique that utilizes BERT embeddings to generate keywords and key phrases most similar to a document.
FYI : KeyBERT is developed by Maarten Grootendorst : https://github.com/MaartenGr/KeyBERT
- candidates: Candidate keywords/keyphrases to use instead of extracting them from the document(s).
- keyphrase_ngram_range: Length, in words, of the extracted keywords/keyphrases.
- stop_words: Stopwords to remove from the document.
- top_n: Return the top n keywords/keyphrases.
- min_df: Minimum document frequency of a word across all documents.
- use_mmr: Whether to use Maximal Marginal Relevance (MMR) for the selection of keywords/keyphrases.
- diversity: The diversity of the results between 0 and 1 if
use_mmris set to True.
We've created a prototype that processes multiple documents and a candidate list of terms for keyword extraction, resulting in three lists:
- Suggested Keywords
- Suggested Key Phrases
- Keywords extracted from candidates
- Keywords from several matched custom candidate list
For user convenience, there are cross and check buttons against each list.
- Clicking the cross button adds the term to the
stopword.txtfile, indicating it as a term to ignore in future extractions. - Clicking the check button adds the term to a
candidates.txtfile for potential future use.
- Stopword Text File: Contains terms to be ignored in future keyword searches across all documents.
- Candidate Text File: Stores relevant terms for efficient keyword retrieval and production of relevant keyword lists.
- Diversity: Utilized in KeyBERT using Maximal Marginal Relevance (MMR) for varied keyword selection.
- A pdf document (mandatory)
- A stop word list (optional)
- Candidate keywords (mandatory)
The output will consist of:
- Suggested keywords
- Suggested key phrases
- Keywords from candidates
- Highlighted text representation of keywords in the document
- Suggested Keywords: Calculated using cosine similarity between word embeddings and document embeddings. Words with the highest similarity scores are displayed.
- Suggested Phrases: Similar to suggested keywords but considers multi-word phrases.
- Keywords from Candidates: Cosine similarity between each word in the candidate list and document embeddings, displaying words above a score of 0.1.
- KeyBERT - https://maartengr.github.io/KeyBERT/
- KeyBERT GITHUB - https://github.com/MaartenGr/keyBERT
Voctractor has been designed and developed in the MAIA project which has funding from European Union's Horizon Europe Research and Innovation programme under grant agreement No 101056935 and facilitated by numerous discussions with Kate Williamson and Sukaina Bharwani from the Stockholm Environment Institute, Andrea Geyer-Scholz from Smart Cities Consulting and Marcelo Rita-Pias from the Federal University of Rio Grande -FURG, Brazil.
-
The project is developed by Aradina Chettakattu Msc, Austrian Institute of Technology email: aradina.chettakattu@ait.ac.at
-
Under the supervision of Dr Denis Havlik, Austrian Institute of Technology email: denis.havlik@ait.ac.at