Karst Exploration: Extracting Terms and Definitions From Karst Domain Corpus
Karst Exploration: Extracting Terms and Definitions From Karst Domain Corpus
Abstract
In this paper, we present the extraction of specialized knowledge from a corpus of karstology
literature. Domain terms are extracted by comparing the domain corpus to a reference corpus,
and several heuristics to improve the extraction process are proposed (filtering based on nested
terms, stopwords and fuzzy matching). We also use a word embedding model to extend the list
of terms, and evaluate the potential of the approach from a term extraction perspective, as
well as in terms of semantic relatedness. This step is followed by an automated term alignment
and analysis of the Slovene and English karst terminology in terms of cognates. Finally, the
corpus is used for extracting domain definitions, as well as triplets, where the latter can be
considered as a potential resource for complementary knowledge-rich context extraction and
visualization.
Keywords: karstology; term extraction; term embeddings; term alignment; definition
extraction; triplets; specialized corpora
1. Introduction
The totality of means of expression in a language can be divided into general language
and specialized language. Even if there is no distinct boundary between the two, it can
be said that general language defines the sum of the means of linguistic expression
encountered by most speakers of a given language, whereas specialized language goes
beyond the general vocabulary based on the socio-linguistic or the subject-related
aspect. The latter arises as a consequence of constant development and specialization
in the fields of science, technology, and sociology (Svensen, 1993: 48-49). Similar to the
definition of technical language by Svensen, in the context of terminology, specialized
language, also called language for special purposes, is defined as a “language used in a
subject field and characterized by the use of specific linguistic means of expression”
(ISO 1087-1:2000).
934
Proceedings of eLex 2019
Castellví (2003: 189) claims that all terms are words by nature and notes that “we
recognize the terminological units from their meaning in a subject field, their internal
structure and their lexical meaning”. According to Myking (2007: 86), the traditional
terminology is concept-based and the new directions are lexeme-based.
Granger (2012) highlights the six most significant innovations of electronic lexicography
in comparison to the traditional methods: a) corpus integration, meaning the inclusion
of authentic texts in the dictionaries; b) more and better data, since there are no more
space limitations and one has the possibility to add multimedia data; c) efficiency of
access (quick search and different possibilities of database organization); d)
customization, meaning that the content can be adapted to the user’s needs; e)
hybridization, denoting that the limits between different types of language resources—
e.g., dictionaries, encyclopaedias, term banks, lexical databases, translation tools—are
breaking down; and f) user input, since collaborative or community-based input is
integrated. Similar can be claimed for terminological work, where recent approaches in
terminology science consider knowledge (represented in texts) as conceptually dynamic
and linguistically varied (Cabré, 1999; Kageura, 2002), and where novel methods in
data acquisition, organization and representation, are being constantly developed.
Knowledge can be extracted from specialized resources automatically, benefiting from
the advances in the field of natural language processing. Moreover, attempts in dynamic,
visual representation of domain knowledge have been proposed in recent years, e.g.,
EcoLexicon1 (Faber et al., 2016).
1
http://ecolexicon.ugr.es/en/index.htm
935
Proceedings of eLex 2019
Within the TermFrame2 project, we focus on the specialized knowledge of karst science,
and plan to develop methods that allow for context- and language-dependent
investigation into a domain, relying on semi-automated tools. In this paper, we apply
some of the methods that we have previously developed to a new domain, resulting in
a repository of karst term and definition candidates in Slovene and English,
contributing to the karstology terminological science. Next, we propose a word
embedding based term list extension and triplet extraction method that can be used
for visualization. These are novel components, contributing to terminological domain
modelling.
This paper is structured as follows. After presenting the related work in automated
specialized knowledge extraction in Section 2, we present the resources used (Section
3), methods (Section 4), results (Section 5) and conclude the paper with a discussion
and plans for future work (Section 6).
2. Related work
Terminological work has undergone a significant change with the emergence of
computational approaches resulting in semi-automated extraction of terms, definitions
and other knowledge structures from raw text. Automatic terminology extraction has
been implemented for various languages, including English (e.g., Sclano & Velardi, 2007;
Frantzi & Ananiadou, 1999; Drouin, 2003) and Slovene (e.g., Vintar, 2010; Pollak et
al., 2012), which are the languages in our corpus. In the last few years, word embeddings
(Mikolov et al., 2013) have become a very popular natural language processing
technique, and several attempts have already been made to utilize word embeddings
for terminology extraction (e.g., Amjadian et al., 2016; Zhang et al., 2017). We use
word embeddings techniques for extending term lists.
Numerous approaches have also been proposed in bilingual term extraction and
alignment, including Gaussier (1998), Kupiec (1993), Lefever et al. (2009), Vintar
(2010), Baisa et al. (2015), as well as Aker et al. (2013), who treat bilingual term
alignment as a binary classification task. The modified version of the latter approach
described in Repar et al. (2018), is also used in this paper.
Automated definition extraction approaches have been developed for several languages,
including English (e.g., Navigli & Velardi, 2010), Slovene (e.g., Fišer et al., 2010) and
multilingual methods (e.g., Faralli & Navigli, 2013). In our work we use a pattern-based
definition extraction method for English and Slovene (Pollak et al., 2012).
2
http://termframe.ff.uni-lj.si/
936
Proceedings of eLex 2019
This study presents the knowledge extraction steps within the TermFrame project,
complementing previous work in karstology modelling presented in Vintar and Grčić-
Simeunović (2017), and contributing to the emerging karstology knowledge base. The
extracted knowledge was used in the frame-based annotation approach, identifying the
semantic categories, relations and relation definitors in definitions of karst concepts, as
presented in Vintar et al. (2019), as well as in topic modelling using term co-occurrence
network presented in Miljković et al. (2019). The work is also closely related to Faber
et al. (2016), a multilingual visual thesaurus of environmental science, which was
developed following a frame-based, cognitively-oriented approach to terminology.
3. Resources
The corpus of karstology was constructed within the TermFrame project; it consists of
Slovene, Croatian and English texts. We focus on the Slovene and English parts of the
TermFrame corpus (v1.0). The English subcorpus contains cca. 1.6 M words and the
Slovene one cca. 1 M words (see Table 1 for details).
English Slovene
In addition, we are using a short gold standard list of Karst domain terms, called the
QUIKK term base3. The QUIKK term base consists of terms in four languages, but for
the purposes of our experiments, the Slovene and English term lists are used, containing
57 and 185 terms, respectively.
3
http://islovar.ff.uni-lj.si/karst
937
Proceedings of eLex 2019
4. Methods
First, we present the procedure of extracting terms by comparing the words in the noun
phrases in the domain and reference corpora, and next we present a method using word
embeddings to extend the list of terms.
For extracting domain terms we use the LUIZ-CF term extractor (Pollak et al., 2012),
which is a variant of LUIZ (Vintar, 2010) refined with scoring and ranking functions.
The term extraction uses part-of-speech patterns for detecting noun phrases and
compares the frequencies of words (lemmas) in the noun phrase in the domain corpus
of karstology and a reference corpus.
The output is a list of term candidates in Slovene and English, above a selected
frequency4 and/or termhood threshold. In addition, we applied the following filtering
and term merging procedures:
Nested term filtering: Nested terms are the terms that appear within other longer
terms and may or may not appear by themselves in the corpus (Frantzi et al.,
2000). As in Repar et al. (2019), the difference between a term and its nested
term is defined by a frequency difference threshold: if a term in a corpus appears
predominantly within a longer string, only the longer term is returned. If not (if
a shorter term appears independently of a longer term more frequently than the
set parameter), both terms are included in the final output.5
Stop word filtering: If a term candidate is found on the stop word list, the term
is excluded from the final list.6
Term merging by fuzzy matching: Frequently, we can find terms that are
extracted as separate terms but are in fact duplicates because they are written
in different variants. This can be due to spelling variations (e.g., British and
American English, using hyphenation or not), typos (which are relatively
4
We set minimum frequency to 15.
5
In our experiments, the parameter is set to 15 to match minimum frequency.
6
General stop words are not problematic, as they are frequent also in a reference corpus, and
therefore not identified as terms by LUIZ-CF. However, the words specific to the academic
discourse, are not frequent in general language and therefore often appear as extracted term
candidates. To exclude them, we use the following short stop word list: example, use,
source, method, approach, table, figure, percentage, et, al., km.
938
Proceedings of eLex 2019
frequent when we deal with large text collections), errors due to pdf-to-text
conversions etc. The proposed term merging is based on Levenshtein edit
distance (Levenshtein, 1966): if two terms are nearly identical (default threshold
is 95%), they will be merged and mapped to a common identifier. In addition,
a rule which handles the case when two terms have a different prefix but the
same tail and should not be recognized as duplicates can be applied.
Word embeddings are vector representations of words, where each word is assigned a
multidimensional vector of real numbers, characterizing the word based on the lexical
context in which it appears. When vectors are computed on very large corpora, and
especially with recent advances in models using neural networks, these representations
have seen a huge success within various natural language processing tasks.
The embeddings capture certain degree of semantics, as words that are similar or
semantically related are closer together in the vector space. Previous research
conducted by Diaz et al. (2016) showed that embeddings can be successfully used for
expanding queries on topic specific texts. In this research, we test if word embeddings
can be used for a similar task of extending the gold standard term lists to find more
domain terms. According to the research conducted by Diaz et al. (2016), embeddings
trained only on small topic specific corpora outperform non-topic specific general
embeddings trained on very large general corpora for the task of query expansion due
to strong language use variation in specialized corpora. Therefore, we use the same
approach for extending the term list and train custom embeddings on the specialized
corpus instead of using pretrained embeddings.
7
To be exact, 50 English terms, and 47 Slovene terms, since only 47 Slovenian terms from
the QUIKK term base appear in the Slovenian corpus.
8
There are several possible multi-word term aggregation approaches, such as summation of
component word vectors, averaging of component word vectors, creating multi-word term
vectors, etc. As comparing different techniques is beyond the scope of this study, we decided
for the simple averaging technique, as previous research on this topic conducted on the
medical domain (Henry et al., 2018) found no statistically significant difference between any
multi-word term aggregation method.
939
Proceedings of eLex 2019
English terms are mapped to Slovene equivalents using a data mining approach by Aker
et al. (2013) reimplemented in Repar et al. (2018). Bilingual term alignment is treated
as a binary classification, with a support vector machine classifier trained on various
dictionary and cognate-based features that express correspondences between the words
(composing a term) in the target and source languages. The first take advantage of
dictionaries (Giza++) created from large parallel corpora, and the latter exploit string-
based word similarity between languages (cf. Gaizauskas et al., 2012). In addition, the
cognate-based features (see Table 2) allow users to identify cognate term pairs, which
are interesting as karst terms in different languages clearly share their origin, but there
exist also well-known examples of non-equivalent cognates (e.g., Slovene “dolina” vs.
English “doline”).
Feature Description
Longest Common Subsequence Ratio Measures the longest common non-consecutive
sequence of characters between two strings
Longest Common Substring Ratio Measures the longest common consecutive string
(LCST) of characters that two strings have in common
Dice similarity 2*LCST / (len(source) + len(target))
Normalized Levensthein distance (LD) 1 - LD / max(len(source), len(target))
We use the pattern-based module of the definition extractor (Pollak et al., 2012), which
is available online.9 The soft pattern matching is used to extract sentences of forms NP
is NP, NP refers to NP, NP denotes NP, etc., and the parameters contain language (EN,
SL), as well as the position of the term in Slovene (if the term must be at the beginning
of the sentence, after a larger set of predefined start patterns (our choice) or anywhere
in a sentence).
As predefined definition patterns (cf. Section 4.3) were designed for extracting specific
knowledge contexts, we complement the approach by open-relation extraction (this
experiment is conducted only for English, as for Slovene the tools are not available).
9
http://clowdflows.org/workflow/8165/
940
Proceedings of eLex 2019
We use ReVerb (Fader et al., 2011), which extracts relation phrases and their arguments
and results in triplets of form:
We believe that in the case that argument1 and argument2 match domain terms, the
triplets can be exploited as a method for extraction of knowledge-rich contexts (an
alternative to definitions). They are also a useful input for visualization of
terminological knowledge and can meet the needs of frame-based terminology, aiming
at facilitating user knowledge acquisition through different types of multimodal and
contextualized information, in order to respond to cognitive, communicative, and
linguistic needs (Gil-Berrozpe et al., 2017). Previously, triplets have been used in other
domains, e.g., in systems biology for building networks from domain literature
(Miljković et al., 2012).
We extracted 4,397 English term candidates and 2,946 Slovene term candidates. A
domain expert and a linguist specialized in terminology with high domain
understanding manually evaluated all term candidates for Slovene and the top 1,823
(above a selected threshold)10 term candidates for English. The following categories
were used:
To distinguish between karst and broader domain terms, the following criterion is used.
While karstology is in itself an interdisciplinary field, in TermFrame the focus is on
karst geomorphology entailing surface and underground landforms, and karst hydrology
10
The reason for the discrepancy in the number of evaluated terms is that the evaluation for
Slovene yielded a much lower number of terms (categories 1 or 2) in Slovene than in
English. Since we need a large number of terms for additional steps, i.e. term alignment, we
instructed the evaluators to process the full list of term candidates for Slovene. If we took
the same number of top terms for Slovene as for English (top 1,823), we get the following
results (cf. Table 3): Not a term: 1,187, Karst term: 140, Domain term: 174, Named entity:
220, Precision: 0.293.
941
Proceedings of eLex 2019
with its typical forms and processes. Terms from neighbouring domains (geography,
biology, geochemistry, etc.) which are not exclusive to karst are considered broader
domain terms. In case of disagreement, the two annotators achieved consensus on the
final category. As presented in Table 3, the resulting list of terms contains 351 karst
terms for English and 158 for Slovene. The newly extracted karst terms, such as cave,
uvala, doline, denudation describing landforms, processes, environment, etc., can serve
for the extension of the manual QUIKK karstology term base, while for example the
term candidate karst region is not considered a term because it is too generic and
compositional, denoting a different underlying semantic relation (a region which
contains karst).
The precision of term extraction is 0.516 for English and 0.235 for Slovene. For examples
of terms in each category, see Table 4, while top terms sorted by termhood score for
English and Slovene are presented in Tables 5 and 6, respectively.
Table 3: Term extraction results. Precision is calculated as the sum of all three positive
categories (1, 2, 3) divided by the number of evaluated terms.
In addition, we evaluate our filtering methods. All nested terms (306 for English, 105
for Slovene) removed by the nested term filtering are correctly eliminated, the stop
word filter did not detect any terms which should not be removed, and all near
duplicates (11 for English, 22 for Slovene) detected with the fuzzy match filter are also
correct (e.g., “ground-water” was detected as a duplicate of “ground water”).
Lang Not a term Karst term Broader domain term Named entity
Slovene dinarska smer slepa dolina naplavna ravnica Planinsko polje
ilovnat material udornica ravnovesna meja Podgorski kras
kataster jam kalcijev karbonat mehansko preperevanje Gorski kotar
942
Proceedings of eLex 2019
Table 5: Top 20 English karst term candidates with frequencies and categorization to karst
terminology (1), broader domain terminology (2), named entity (3) or non-term (0).
The method was tested on 47 English and 50 Slovene source terms (i.e. the terms from
the gold standard list), for which out of the 20 most related words (according to the
cosine distance between the source term and the related word), four per each source
term were selected for evaluation (first, second, tenth and twentieth ranked words),
resulting in 200 term-word pairs for English and 188 for Slovene.11 Examples of ranked
related words for five English and five Slovene terms are presented in Table 7.
11
In this section, we intentionally name related words as words and not as terms, to contrast
them to the gold standard list of terms to which they are compared. As shown in the
evaluation, they can be evaluated as terms or not in the next step.
943
Proceedings of eLex 2019
944
Proceedings of eLex 2019
The two human evaluators evaluated the related words according to two criteria:
The first criterion is measured on a scale with four nominal classes (see Section 5.1.1),
while the second criterion uses a numerical scale from zero to ten, following the
evaluation procedure of Finkelstein et al. (2002), where zero suggests no semantic
similarity and ten suggests very close semantic relation (fractional scores were also
allowed). The inter-annotator agreement between the two evaluators (according to the
Cohen’s kappa coefficient) is 0.689 for the first criterion and 0.513 for the second
criterion for English, and 0.594 for the first criterion and 0.389 for the second criterion
for the Slovene evaluation.
Table 8 presents the results for the evaluation of embeddings-based term extension.
Out of 200 English term-word pairs, 112 were manually labelled as term-term pairs by
at least one evaluator, which suggests that, at least for English, embeddings can be
used for extending the term list. Out of these 112 related terms, 52 were labelled as
karst specific terms by at least one evaluator. For Slovenian, the results are worse, since
out of 188 term-word pairs only 69 were labelled as term-term pairs, and out of these
only 36 are karst specific.
Out of 112 English term-term pairs, 62 were ranked first and second and 50 were ranked
tenth and twentieth according to the cosine distance similarity. Out of 69 Slovenian
term-term pairs, 39 were ranked first or second and 30 were ranked as tenth or twentieth.
This suggests that words that have most similar embeddings to terms according to the
cosine distance (rank 1 and rank 2) are also more likely to be terms themselves than
words that have less similar embeddings (rank 10 and rank 20). Similar reasoning
applies to karst specific term-term pairs, where for English 30 were ranked first or
second and 22 were ranked tenth or twentieth. For Slovenian, 24 out of 36 were ranked
first or second and 12 were ranked tenth or twentieth.
When it comes to semantic similarity, unsurprisingly better ranked related words were
manually evaluated as semantically more similar. For example, the first ranked (most
similar to terms according to the cosine distance) English related words got an average
semantic similarity score12 of 4.040 out of ten, and the first ranked Slovenian related
words got an average semantic similarity score of 4.468. These are larger than the
semantic similarity score averages of 2.610 and 3.064 for English and Slovenian related
words ranked as twentieth, respectively. Another interesting observation is the fact that
the average semantic similarity score is the highest for English karst specific term-terms
pairs (5.702) and much lower if all the term-word pairs are considered (3.325). If we
12
The semantic similarity score for each related word is calculated as an average between the
two semantic similarity scores given by two evaluators.
945
Proceedings of eLex 2019
consider all term-term pairs, the average semantic similarity score is 4.710. The same
applies for Slovenian term-word pairs, with semantic similarity score average rising
from 3.859 when all term-words pairs are considered, to 5.536 when only term-term
pairs are considered, and up to 6.722 when only karst specific term-term pairs are
considered.
We also measure the correlation between cosine distances and the semantic similarity
scores for term-word pairs using Pearson and Spearman correlation coefficients. The
correlation is generally low, the highest being measured for Slovenian Karst specific
term-term pairs where the Pearson correlation reached the value of 0.341 and Spearman
the value of 0.208. There was no correlation measured on Slovene term-term pairs and
surprisingly, a small negative Pearson correlation was measured on Slovenian karst
specific term-term pairs and a small negative Spearman correlation was measured on
English pairs which were labelled as terms.
We evaluate the approach first on the QUIKK gold standard, where 100% precision
and recall above 40% were obtained. Next, we also add to the QUIKK gold standard
the terms extracted using the statistical method and term embeddings that were
positively evaluated. The total list of 908 English terms and 391 Slovene terms were
input to the term alignment algorithm. The resulting list of 93 aligned term pairs was
manually evaluated. In this experiment, the precision was 77.42% (72 term alignments
out of 93 were correct), while the recall could not be calculated, as the gold standard
alignment was not available.
English Slovene
Distribution 50 50 50 50 47 47 47 47
Avg. sem. score 4.040 3.540 3.110 2.610 4.872 4.468 3.032 3.064
Terms 112 69
946
Proceedings of eLex 2019
Distribution 32 30 29 21 17 22 15 15
Karst terms 52 36
Distribution 16 14 15 7 12 12 5 7
Not Terms 88 119
Distribution 18 20 21 29 30 25 32 32
Table 8: English and Slovenian embeddings evaluation according to two criteria described in
Section 4.1.2. Avg. sem. score stands for the average of manually prescribed semantic
similarity scores for each term-word pair, Avg. cos. dist stands for the average cosine
distance, Pearson corr. is a Pearson correlation coefficient between the semantic similarity
score and cosine distance values and Spearman corr. is a Spearman correlation coefficient
between the semantic similarity score and cosine distance values.
As described in Section 4.2, karst terminology contains a considerable amount of
cognates. See Table 9 for cognate values for Longest Common Substring Ratio, Longest
Common Subsequence Ratio, Dice Similarity, and Normalized Levensthein Distance).
947
Proceedings of eLex 2019
In total, 1,320 definition candidates were extracted for English, and 1,218 for Slovene.
Definition candidates were manually validated by domain experts following two criteria:
whether the sentence defines the concept, and whether the concept belongs to the
domain of karstology. To distinguish between definitions and non-definitions the
experts checked whether the sentence explains what the concept is, either by specifying
its hypernym and a set of distinguishing features (analytical), or by listing its hyponyms
(extensional), or by using another explanatory strategy (e.g., functional definitions).
The definition candidates were then assigned one of the following three categories:
Non-definitions (Example: The oldest rocks are the sandstones of Permian age,
which are only locally present.)
Table 9: Cognate scores for a sample of Slovene and English term pairs
As presented in Table 10, for English, out of 1,320 definition candidates 218 were
evaluated as karst definitions, and an additional 187 as broader domain definitions.
The precision of the definition extraction on karst domain is thus 0.16 for strictly karst
domain definitions, and 0.31 for broader domain definitions (incl. karst definitions).
For Slovene, there are 1218 definition candidates, out of which 260 are karst definitions
and 166 are from broader domain. The precision for definition extraction for Slovene is
thus 0.21 for strictly karst domain, and 0.35 for karst and broader domain.
948
Proceedings of eLex 2019
English Slovene
The karst definitions were then used by domain experts and linguists in the scope of
the TermFrame project for a fine-grained, annotation process, following frame-based
terminology principles (Faber, 2015). The annotation principles and results are
presented in Vintar et al. (2019), where several annotation layers are proposed:
definition element layers (definiendum, definitor and genus); semantic categories (top
level concepts are landforms, processes, geomes, entities, instruments/methods) and
relations (16 relations, such as has_form, has_cause).
The English subcorpus yielded 80,564 triplets. Below we list selected examples of
relevant triplets that are closely related to the karst domain:
<Sinkholes located miles away from rivers, can flood, homes and businesses>
<Some collapse sinkholes, develop, where collapse of the cave roof reaches the
surface of the Earth>
The extracted triplets are analysed according to the most common relation patterns,
to estimate their potential for extending predefined definition patterns. From the
relation phrase part of the triplet, the verb is identified, showing the most frequent
verb structures. We remove all stopwords from the relation phrase using a general list
of 174 English stopwords. Table 11 lists 20 most frequent verb structures found in the
processed 24 documents. The results show that many karst-specific relations can be
detected (e.g., verbs related to different geological processes, such as occur, develop and
form) but still many general verbs are also frequent. The frequent relations from triplets
will be discussed in relation to the predefined set of relations used in definition frames
annotation (cf. Vintar et al., 2019).
949
Proceedings of eLex 2019
Figure 1: Visualization of a part of the triplet network. Prior to the visualization, relation
phrases were lemmatized and the triplets were filtered according to the short gold standard
list of Karst domain extended with an additional evaluated list of terms.
950
Proceedings of eLex 2019
For visualization, after filtering the triplets by keeping only the ones where in a triplet
<argument1, relation phrase, argument2> the two arguments are karst terms13, we
construct a network where arguments are used as nodes and relation phrases as arcs.
A visualization of a part of the triplet network obtained using Biomine network
visualization tool (Eronen & Toivonen, 2012) is shown in Figure 1.
We apply the proposed pipeline to a corpus of karst specialized texts. The main value
of the evaluation steps of term and definition extraction is to obtain new gold standard
karst knowledge resources that will be used in the scope of the TermFrame project for
fine grained analysis and novel visual representation corresponding to the cognitive
shifts in recent terminology science approaches. On the other hand, we believe that the
evaluation of word embeddings opens new perspectives to e-lexicography and
terminography, as it shows that popular techniques from natural language processing
are relatively successful for automatically extending the gold standard term lists (cca.
half of English and one third of Slovene terms being valid terms). The evaluation also
shows that the semantic similarity score is higher for the closest matching words
(considering cosine similarity between embeddings) than for the lower ranked words,
which suggests that embeddings do in fact manage to capture some semantic relations
despite a relatively small training corpus. On the other hand, the correlation between
cosine similarity and manual similarity score is weak, which might indicate high
variance in cosine similarity for related words for different terms. We believe that
semantic information has a huge potential for contributing to the organization of term
bases and visually interesting knowledge maps. In the same line, we illustrate how
triplet extraction in combination with term matching can serve as a knowledge
representation module used for visualization.
13
QUIKK terms and manually evaluated terms from Section 5.1.1.
951
Proceedings of eLex 2019
In future work, we will consider extending the corpus by using web-crawling techniques.
Next, our aim is to merge the pipeline to a set of services to support users in a
knowledge extraction process, for populating term bases, as well as in knowledge
visualization. We believe that such tools will contribute to better understanding of
similarities and differences in terminological expression between languages, and support
representations reflecting dynamic culture and language specific knowledge.
7. Acknowledgements
The work was supported by the Slovenian Research Agency through the core research
programme (P2-0103) and research project Terminology and knowledge frames across
languages (J6-9372). This work was supported also by the EU Horizon 2020 research
and innovation programme, Grant No. 825153, EMBEDDIA (Cross-Lingual
Embeddings for Less-Represented Languages in European News Media). The results of
this publication reflect only the authors’ views and the EC is not responsible for any
use that may be made of the information it contains. We would also like to thank Š.
Vintar, U. Stepišnik, D. Miljković and other members of the TermFrame project for
their collaboration.
8. References
Aker, A., Paramita, M. & Gaizauskas, R. (2013). Extracting bilingual terminologies
from comparable corpora. In Proceedings of the 51st Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), volume 1,
pp. 402–411.
Amjadian, E., Inkpen, D., Paribakht, T. & Faez, F. (2016). Local-Global Vectors to
Improve Unigram Terminology Extraction. In Proceedings of the 5th International
Workshop on Computational Terminology (Computerm2016), pp. 2–11.
Baisa, V., Ulipová, B. & Cukr, M. (2015). Bilingual terminology extraction in Sketch
Engine. In 9th Workshop on Recent Advances in Slavonic Natural Language
Processing, pp. 61–67.
Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. (2017). Enriching Word Vectors
with Subword Information. Transactions of the Association for Computational
Linguistics, 5, pp. 135–146.
Cabré, M.T. (1999). Terminology: Theory, Methods, and Application. Amsterdam, The
Netherlands and Philadelphia, USA: John Benjamins Publishing.
Cabré Castellví, M. T. (2003). Theories of Terminology: Their Description, Prescription
and Explanation. Terminology 9 (2), p. 163–199.
Ciaramita, M., Gangemi, A., Ratsch, E., Šaric, J. & Rojas, I. (2005). Unsupervised
Learning of Semantic Relations Between Concepts of a Molecular Biology
Ontology. In Proceedings of the Nineteenth International Joint Conference on
Artificial Intelligence (IJCAI’05), pp. 659–664.
Davidov, D. & Rappoport, A. (2006). Efficient Unsupervised Discovery of Word
952
Proceedings of eLex 2019
953
Proceedings of eLex 2019
954
Proceedings of eLex 2019
Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). Efficient Estimation of Word
Representations in Vector Space. In Proceedings to The International Conference
on Learning Representations 2013.
Miljković, D., Kralj, J., Stepišnik, U. & Pollak, S. (2019). Communities of related terms
in Karst terminology co-occurrence network. In I. Kosem et al. (eds.) Proceedings
of eLex 2019, pp. 357-373.
Miljković, D., Stare, T., Mozetič, I., Podpečan, V., Petek, M., Witek, K., Dermastia,
M., Lavrač, N. & Gruden, K. (2012). Signalling Network Construction for
Modelling Plant Defence Response. PLOS ONE, 7(12), pp. 1–18.
https://doi.org/10.1371/journal.pone.0051822.
Myking, J. (2007). No Fixed Boundaries. In A. Bassey (ed.) Indeterminacy in
Terminology and LSP: Studies in Honour of Heribert Picht, chapter 6.
Amsterdam, The Netherlands and Philadelphia, USA: John Benjamins
Publishing, pp. 73–91.
Nastase, V., Nakov, P., Séaghdha, D. Ó. & Szpakowicz, S. (2013). Semantic Relations
Between Nominals. In G. Hirst (ed.) Synthesis Lectures on Human Language
Technologies. London: Morgan & Claypool Publishers, pp. 1–119.
Navigli, R. & Velardi, P. (2010). Learning Word-Class Lattices for Definition and
Hypernym Extraction. In Proceedings of the Forty-Eighth Annual Meeting of the
Association for Computational Linguistics. Uppsala, Sweden, pp. 1318–1327.
Pearson, J. (1998). Terms in Context. In E. Tognini-Bonelli & W. Teubert (eds.) SCL
Series, Vol. 1. Amsterdam, The Netherlands and Philadelphia, USA: John
Benjamins Publishing.
Pollak, S., Vavpetič, A., Kranjc, J., Lavrač, N. & Špela Vintar (2012). NLP workflow
for on-line definition extraction from English and Slovene text corpora. In J.
Jancsary (ed.) Proceedings of KONVENS 2012. ÖGAI, pp. 53–60. Main track:
oral presentations.
Repar, A., Martinc, M. & Pollak, S. (2018). Machine Learning Approach to Bilingual
Terminology Alignment: Reimplementation and Adaptation. In A. Branco, N.
Calzolari & K. Choukri (eds.) Proceedings of the Eleventh International
Conference on Language Resources and Evaluation (LREC 2018). Paris, France:
European Language Resources Association (ELRA).
Repar, A., Podpečan, V., Vavpetič, A., Lavrač, N. & Pollak, S. (2019). TermEnsembler:
An ensemble learning approach to bilingual term extraction and alignment.
Terminology, 25(1).
Roller, S., Kiela, D. & Nickel, M. (2018). Hearst Patterns Revisited: Automatic
Hypernym Detection from Large Text Corpora. In Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics (Volume 2: Short
Papers). Melbourne, Australia: Association for Computational Linguistics, pp.
358–363. URL https://www.aclweb.org/anthology/P18-2057.
Sclano, F. & Velardi, P. (2007). TermExtractor: a Web Application to Learn the
Common Terminology of Interest Groups and Research Communities. In
Proceedings of the 9th Conf on Terminology and Artificial Intelligence TIA 2007,
955
Proceedings of eLex 2019
pp. 8–9.
Svensen, B. (1993). Practical Lexicography: Principles and Methods Of Dictionary
Making. Oxford University Press.
Vintar, Š. (2010). Bilingual term recognition revisited: The bag-of-equivalents term
alignment approach and its evaluation. Terminology, 16(2), pp. 141–158.
Vintar, Š. & Grčić-Simeunović, L. (2017). Definition frames as language-dependent
models of knowledge transfer. Fachsprache: internationale Zeitschrift für
Fachsprachenforschung, -didaktik und Terminologie, 39(1/2), pp. 43–58.
Vintar, Š., Saksida, A., Stepišnik, U. & Vrtovec, K. (2019). Knowledge frames in
karstology: the TermFrame approach to extract knowledge structures from
definitions. In I. Kosem et al. (eds.) Proceedings of eLex 2019, pp. 305-318.
Zhang, Z., Gao, J. & Ciravegna, F. (2017). SemRe-Rank: Incorporating Semantic
Relatedness to Improve Automatic Term Extraction Using Personalized
PageRank. arXiv preprint arXiv:1711.03373.
This work is licensed under the Creative Commons Attribution ShareAlike 4.0
International License.
http://creativecommons.org/licenses/by-sa/4.0/
956