Cyber2 Namedentity
Cyber2 Namedentity
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2984582, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2020.DOI
ABSTRACT Cybersecurity named entity recognition is an important part of threat information extraction
from large-scale unstructured text collection in many cybersecurity applications. Most existing security
entity recognition studies and systems use regular matching strategy or machine learning algorithms. Due
to the peculiarity and complexity of security named entity, these models ignore the characteristic of security
data and the correlation of entities. Therefore, through the in-depth study of security entity characteristic,
we propose a novel security named entity recognition model based on regular expressions and known-entity
dictionary as well as conditional random fields (CRF) combined with four feature templates. This model
is named RDF-CRF. The rule-based expressions can match security entities with good accuracy in simpler
situations, the known-entity dictionary can extract common and specific security entity, and the CRF-based
extractor leverages the identified entities by rule-based and dictionary-based extractors to further improve
the recognition performance. In order to demonstrate the effectiveness of our proposed model, extensive
experiments are performed on a security text dataset collected from public security webs. The experimental
results show that can achieve better performance than state-of-the-art methods.
INDEX TERMS Cybersecurity, named entity recognition, regular expression, known-entity dictionary,
conditional random fields.
VOLUME 4, 2020 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2984582, IEEE Access
also a variety of efforts studying different methods for the that with the expansion and improvement of the corpus in the
task, which can be divided into two classes: rule-based and follow-up work, the accuracy of the recognized professional
machine learning-based. vocabulary will also be significantly improved.
The rule-based methods can extract named entity with In this paper, we propose a novel security entity recog-
good accuracy in a simple manner when the to-be-extracted nition model based on conditional random fields combined
information follows regular speech patterns such as email with four feature templates and incorporating regular ex-
address, host IP, and Common Vulnerabilities and Exposures pressions, known-entity dictionary for preprocessing, named
(CVE) [17], [18]. However, these methods are not suitable RDF-CRF. Specifically, rule-based approach can first extract
for complex situations while to-be-extracted entity includes named entity with good accuracy in simpler situations, then
many variations or comes from irregular structured text, dictionary-based method can match common and specific
which is more in line with the actual situation on the net- security entity. After matching by rule-based and dictionary-
work. Meanwhile, these methods are difficult to identify new based methods, the word sequence will be more accurately
named entity. Moreover, designing rule-based systems is very matched to the feature templates by considering contextual
time-consuming and requires expert field knowledge. There- information so that CRF-based model can further improves
fore, the rule-based methods lead to unsatisfactory results the recognition performance. To demonstrate the effective-
for cybersecurity named entity identification in the complex ness of our proposed model, extensive experiments are per-
situations. Taking into consideration the good performance formed on a security dataset collected from security Webs.
and simplicity of rule-based methods and the regular patterns The experimental results shows that the proposed method can
of some security entities such as IP and CVE, in this paper achieve better performance than state-of-the-art methods.
we also introduce the rule-based template to extract cyberse- The contributions of this paper are summarized as follows.
curity named entities. • We propose a novel security named entity recognition
In these more complex situations, machine learning-based model by using a combination of regular expressions,
methods outperform rule-based ones by tuning general algo- known-entity dictionary and conditional random fields.
rithms with existing data. Meanwhile, they can identify new In the proposed model, the identified entities by rule-
entities from training corpus and are suitable for widespread based and dictionary-based approaches can further as-
applications. Recent years, a lot of approaches for security- sist CRF-based model in improving the performance of
relevant named entity recognition (NER) from unstructured cybersecurity entity recognition.
text documents have been proposed from different perspec- • We also design four feature templates for unstructured
tives, including conditional random fields (CRF) [19], [20], security entity recognition, including atomic features,
support vector machines (SVM) [16], expectation regulariza- combination features, maker features, and semantic fea-
tion [14], bootstrapping algorithm [21], maximum entropy tures, to filter the feature vectors of current word for
model (ME) [22], and long short-term memory (LSTM) conditional random fields.
[23], [24] etc. However, all of the above machine learning • Various experiments are conducted on real-world cyber-
methods fail to yield satisfactory results for identifying cy- security dataset, and the results demonstrate that our
bersecurity related concepts and entities from unstructured proposed model can achieve better prediction perfor-
cybersecurity texts collection. Through analyzing these texts, mance than the state-of-the-art methods.
we find that existing entity recognition techniques is not
suitable for the task. Although the named entity recognition The remainder of this paper is organized as follows: Sec-
technology has gradually matured in the general field, when tion II reviews related work. Section III describes the pro-
it is directly applied to the professional zone, it usually fails posed model and provides an efficient optimization method
to produce satisfactory results. For example, in the field of for the solution. We empirically evaluate our method on
biomedicine, Dongliang et al. [25] illustrates, despite the real-world dataset in Section IV, including a comparison to
traditional method is easy to use,the assumptions it relies on competing methods. We conclude the paper in Section V.
do not fully reflect the actual situation of a large number
of complex biological texts, so the accuracy is relatively II. RELATED WORK
poor. The same problem also occurs in the field of cyber These studies on security named entity recognition can be
security. This is because cybersecurity texts contain a lot of fallen into two categories: rule-based and machine learning-
security vocabularies, such as file names, hash value, and based approaches. Next, we briefly review these works.
even attack tools. On the other hand, these models need to
manually explore a wide range of features and ignore the A. RULE-BASED ENTITY EXTRACTION METHODS
correlation of entities, which is not amenable to large-scale The rule-based matching methods to locate and extract infor-
applications.The rules and dictionaries constructed in this mation by constructing regular expressions or other heuristic
paper, as well as the features extracted for training the model, rules. For example, Liao et al. [17] propose a fully au-
are obtained through observation and training of corpus in tomated Indicators of Compromise (IOC) [26] extraction,
the security field, so they are generally applicable to tasks named iACE. iACE uses a set of regular expressions and
in such field. The experimental results of the article prove common context terms extracted from iocterms to identify
2 VOLUME 4, 2020
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2984582, IEEE Access
the IOC tokens, such as IP and MD5 string. Balduccini [18] III. THE PROPOSED MODEL
design a set of regular expressions for matching each entity In this section, we present a novel ensemble learning ap-
contained in the file of cyber assets. However, due to the proach for security entity extraction from documents. The
unstructured characteristics and diversity of many security proposed model consists of rule-based extractor, dictionary-
entities, it is very difficult to construct rules for all these types based extractor and CRF-based extractor. Rule-based ex-
of entity. As a result, the heuristics strategy is expensive and tractor is designed based on regular expressions, dictionary-
unimplemented in large scale application. based extractor includes known-entity lists, and CRF-based
extractor leverages the identified entities by rule-based and
dictionary-based extractors to improve the recognition per-
B. MACHINE LEARNING-BASED ENTITY EXTRACTION formance. The overall architecture of the model is illustrated
METHODS in Figure 1.
The machine learning-based approaches use training corpus
to construct statistical learning models, which can realize A. RULE-BASED EXTRACTOR
automatic information extraction. Many efforts have been A lot of entities have certain rule patterns in the domain
made in the task of cybersecurity named entity recognition. of cybersecurity. Through a large number of observations
For instance, Lal et al. [20] utilize conditional random fields based on unstructured security texts, we find that URL is
algorithm to extract cybersecurity related concepts and en- started with http/https string, Email contains symbol @ in
tities by using a set of features from manually annotated the middle of a string and CVE follows specific named
security texts. Joshi et al. [19] use conditional random field format. Hence, these security entities can be extracted based
to identify cybersecurity-related entities, concepts and rela- on regular expression matching. According to the naming
tions from the National Vulnerability Database and from text rules of specific security entities, we design the template of
sources. Deliu et al. [16] extract cyber threat intelligence regular expression rules, as shown in Table 1. The rule-based
from hacker forums based on support vector machines and extractor have the properties of high precision and high recall
convolutional neural networks. Jones et al. [21] implement as well as scalability.
a bootstrapping algorithm for extracting security entities
and their relationships from security texts. Ritter et al. [14] TABLE 1: The example of regular expression
propose a weakly supervised seed-based approach to event Entity Types Regular Expression
extraction from Twitter. Mittal et al. [1] analyze tweets about [A-Za-z0-9-_\· ]+\· (txt|php|exe|dll|bat|sys|htm
cybersecurity and issue timely threat alerts to security ana- Filename |html|js|jar|jpg|png|vb|scr|pif |chm|zip|rar
lysts. Weerawardhana et al. [27] present machine learning- |cab|pdf |doc|docx|ppt|pptx|xls|xlsx|swf |gif )
Filepath [a-zA-Z]:(\\([0-9a-zA-Z]+)
based and part-of-speech tagging approaches to information Email [a-z][_a-z0-9-.]+@[a-z0-9-]+˙[a-z]+
extraction from online vulnerability databases. Bridges et al. SHA1 [a-f0-9]{40}|[A-F0-9]{40}
[22] propose a Maximum Entropy Model trained with the SHA256 [a-f0-9]{64}|[A-F0-9]{64}
CVE CVE−[0-9]{4}−[0-9]{4,6}
many security corpus and achieve a high performance of (https?|ftp|file)://[-A-Za-z0-9+&@#/%?=∼ _|! :
identification and classification of appropriate entities. Gasmi URL
, .;]+[-A-Za-z0-9+&@#/%?=∼ _|]
et al. [23] combine the advantage of Long Short-Term Mem- (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-
IPv4 9]?)\· )\{3\}(?:25[0-5]|2[0-4][0-9]|[01]?[0-
ory (LSTM) and Conditional Random Field (CRF) methods 9][0-9]?)(/([0-2][0-9]|3[0-2]|[0-9]))?
to improve the accuracy of NER extraction compared with
traditional pure statistical CRF method. Furthermore, Qin
et al. [24] propose a combined model of neural networks
B. DICTIONARY-BASED EXTRACTOR
which is called FT-CNN-BiLSTM-CRF. When training the
models,they use feature templates to extract context features As far as we know, existing many named entities are well
as we do and achieve an F-score of 0.86 on their network known concepts in the cybersecurity domain, including large
security dataset. security companies (e.g., Cisco, FireEye, and IBM, etc.),
software products (e.g., operating systems, firewalls, and
In conclusion, although the above mentioned methods anti-virus software, etc.) and hacker groups (e.g., OurMine,
work well to some extent in incorporating one or two of the Anonymous, and DCLeaks, etc.). Based on these observa-
three components (i.e., rule-based method, dictionary-based tions, we also design a known-entity dictionary including
method and machine learning-based method), none of them various entities. The entities can be categorized into the
integrate all the information from these three components following categories: company, hardware, software, attack
into an unified learning framework for cybersecurity named means, operating system, protocol, hacker groups and so on.
entity recognition, resulting in dissatisfactory results. To the
best of our knowledge, there is still a lack of cybersecurity C. CONDITIONAL RANDOM FIELDS-BASED
named entity recognition method that extract entities of se- EXTRACTOR
curity texts at high precision level.
CRF model can further extract the undiscovered entities on
basis of the identified entities by rule-based extractor and
VOLUME 4, 2020 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2984582, IEEE Access
Rule-based Dictionary-based … …
extractor extractor
FIGURE 1: Overall architecture of security entity recognition model. Our proposed framework consists of three components:
(1) rule-based extractor, (2) dictionary-based extractor and (3) CRF-based extractor.
dictionary-based extractor. We propose four feature tem- TABLE 2: The template of atomic features
plates to filter the feature vectors of current word for CRF Atomic Description
model. Features
Word(0) Current word
Word(-1) The first word on the left of current word
1) Atomic Features Template Word(-2) The second word on the left of current word
A simple but powerful method is to use tokenization and Part- Word(1) The first word on the right of current word
Word(2) The second word on the right of current word
Of-Speech (POS) tagger for named entity recognition. Due POS(0) The part of speech of current word
to not be separable again, we consider the features of part of POS(-1)
The part of speech of the first word on the left
speech and morphology of words as atomic features. Table 2 of current word
The part of speech of the second word on the
summarizes the detailed information of the atomic features. POS(-2)
left of current word
According to Table 2, when the current word is "Google", The part of speech of the first word on the right
POS(1)
of current word
which belong to the independent organization word, the The part of speech of the second word on the
corresponding feature functions can be generated as follows: POS(2)
right of current word
1 if Word(0) = "Google" and y = Org
f (x, y) = (1)
0 otherwise
windows, but it can not adequately describe the complex
where the variable y represents the label of the current word. phenomena of language.
The template describes the individual morphology or part
of speech of each word in the current word and its context
4 VOLUME 4, 2020
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2984582, IEEE Access
TABLE 3: The template of combination features TABLE 4: The template of marker features.
Combination Description Marker Features Description
Features Tag(-1) Entity tag of first word on the left of current word
Word(0)+POS(0) Current word and part of speech Entity tag of second word on the left of current
Tag(-2)
Current word and the first word on the left of word
Word(0)+Word(-1)
current word Entity tags of the first word and the second word
Tag(-1)+Tag(-2)
Current word and the first word on the right of on the left of current word
Word(0)+Word(1)
current word The part of speech of current word and entity
POS(0)+Tag(-1)
The first word on the left of current word and mark of the first word on the left of current word
Word(-1)+POS(0)
part of speech of current word The part of speech of current word and entity
POS(0)+Tag(-2)
Word(0)+POS(1) Current word and part of speech of current word mark of second word on the left of current word
The first word and part of speech on the left of The part of speech of current word and entity
Word(-1)+POS(-1) POS(0)+Tag(1)
current word mark of first word on the right of current word
The first word and the second word on the left Current word and entity mark of first word on the
Word(-1)+Word(-2) Word(0)+Tag(-1)
of current word left of current word
The second word and part of speech on the left Current word and entity mark of second word on
Word(-2)+POS(-2) Word(0)+Tag(-2)
of current word the left of current word
The first word and the second word on the right Current word and entity mark of first word on the
Word(1)+Word(2) Word(0)+Tag(1)
of current word right of current word
The first word on the left of current word and POS(0)+Tag(- The part of speech of current word and entity
Word(-1)+Word(1)
the first word on the right of current word 1)+Tag(-2) tags of first word and second word on the left of
The first word and part of speech on the right of current word
Word(1)+POS(0)
current word Tag(- Entity tag of first word on the left of current word
The part of speech of the second word and the 1)+POS(0)+POS(1) and part of speech of current word and part of
POS(-2)+POS(-1)
first word on the left of current word speech of first word on the right of current word
The part of speech of current word and the part Tag(-1)+POS(- Entity tag of first word on the left of current word
POS(-2)+POS(0)
of the second word on the left of current word 1)+POS(0) and part of speech of first word on the left of
The part of the first word on the left of current current word and part of speech of current word
POS(-1)+POS(0)
word and the part of the current word Tag(- Entity tag of first word on the left of current word
The part of the first word on the left of current 1)+POS(0)+Word(0) and part of speech of current word and current
POS(-1)+POS(1)
word and the part of the first word on the right word
The part of the word of current word and the Tag(-2)+Tag(- Entity tags of first word and second word on the
POS(0)+POS(1)
part of the word of the first word on the right 1)+POS(0) left of current word and part of speech of current
The part of speech of current word and the word
POS(0)+POS(2)
second word on the right of current word
The part of speech of the first word and the
POS(1)+POS(2)
second word on the right of current word
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2984582, IEEE Access
TABLE 5: The template of semantic features Algorithm 1 The learning algorithm for feature selection
Semantic Features Description Require: cybersecurity text corpus D, the library of above
CUR_PER_FRIST Whether the current word is name four feature templates T
CUR_ORG_SUF Whether the current word is an organization Ensure: feature set F
name suffix
NEXT_ORG_SUF Whether the two words on the right side of 1: choose a template T from the library of template T ;
current word contain organization suffix 2: read a word w from vocabulary V generated by cyberse-
LOC_INDICATION Whether the left or right words of current word curity text corpus D;
contain place indicators
PER_INDICATION Whether the left or right words of current word 3: while T ∈ T do
contain name indication 4: while w ∈ V do
ORG_INDICATION Whether the left or right words of current word 5: match current template T and current word w, and
contain organization indicator
CUR_LOC Whether the current word is a common place then generate a feature f
name 6: if f ∈ F then
CUR_ORG Whether the current word is a common organi- 7: increment count for f
zation name
CUR_PER_NAME Whether the current word is a common name 8: else
CUR_LOC 9: add f to F
+LOC_INDICATION Whether the current word is a common place 10: end if
name and whether the two words around the
current word contain place name indicators 11: end while
CUR_PER_FRIST 12: end while
+PER_INDICATION Whether the current word is a Chinese surname 13: return F
and the left and right words contain a person
name
Tag(- The first word on the left side of current word
1)+CUR_ORG_SUF is the named entity and the current word is the
institutional feature suffix and computational efficiency, we use the threshold method,
Tag(-1)+CUR_LOC The first word on the left side of current word is and the threshold is set to 2.
entity and the current word is the place name.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2984582, IEEE Access
TABLE 6: Statistics of the constructed dataset • HMM: Hidden Markov Model (HMM) is a statistical
Class Number Class Number Markov model in which the system being modeled is
CVE 68 Product 1402 assumed to be a Markov process with unobservable
AS 8 Organization 3047 (i.e. hidden) states. The hidden Markov model can be
Cert 10 Person 1372
Host 14 Place 518
represented as the simplest dynamic Bayesian network
Domain 25 Threat 21 [28].
Email 17 Hacker_Group 62 • MEMM: Maximum Entropy Markov Model (MEMM)
MD5 31 Attack 19 makes use of both the HMM framework to predict
Registry 22 Software 427
SHA1 15 Protocol 25 sequence labels given an observation sequence, but in-
SHA256 18 Conference 14 corporating the multinomial Logistic Regression (aka
URL 42 Report 80 Maximum Entropy), which gives freedom in the type
IP 24 File_Path 43
and number of features one can extract from the obser-
File_Name 71 Event 18
vation sequence [22].
• CRF: Conditional Random Fields (CRF) is a discrim-
be found by using gradient descent on parameters λ as inative probabilistic graphical model. It use contextual
m n
information from previous labels, thus increasing the
∂L −1 X X amount of information. The model has to make a good
= fi (xk , i, yi−1
k
, yik )
∂λ m i=1
prediction [20].
k=1
m (7)
X
k k
The neural network method has become a major topic in
+ p{y|x , λ}fi (x , i, yi−1 , yi ) the field of natural language processing (NLP) recently, but
k=1
its training complexity is often high, generally used to solve
CRF estimates the global probability, and establishes a complex and high-level tasks, such as machine translation,
unified probability model on all states. Hence, CRF is a text understanding and so on. At the expense of certain
relatively good model in named entity recognition. complexity and computational speed, there are also some
researchers use Long short-term memory (LSTM) and their
IV. EXPERIMENTS deformation models to extract cybersecurity entities, such as
A. DATA PREPARATION LSTM-CRF [23] and FT-CNN-BiLSTM-CRF [24], and the
Unlike named entity recognition in the general field, cyber results proved such models have a certain degree recognition
security lacks large-scale publicly available dataset and an- ability on their datasets.
notation methods. Therefore, we construct a standard ground So we also compare the effectiveness of our proposed
truth dataset through the following construction process. model with the following state-of-the-art baseline methods
First, we collect a large amount of security text corpus from on the same dataset.
official security forums 1 , software vendors bulletin boards 2 ,
and various blog articles. Second, we choose a collaborative • LSTM-CRF: LSTM is a special recurrent neural net-
text annotation system brat 3 , which is an open source Web work.The advantage of LSTM is to obtain the relation-
annotation tool that can annotate a large number of text ship between the sample and the sample over a long time
online. Third, the members of this collaborative annotation sequence, and BiLSTM can more effectively acquire the
using brat tools are domain experts who have rich knowledge features before and after the input sentence. This model
of cybersecurity. Each document is annotated by at least extract features by the LSTM and predict entity types by
three users in turn. The ground-truth class labels are selected CRF [23].
based on the majority vote mechanism, Finally, about 14,000 • FT-CNN-BiLSTM-CRF: In this model, the Convolu-
unstructured texts from cyber security domain have been tional Neural Networks (CNN) is used to extract the
marked, in which the training set consists of 70% of the character-level feature and the BiLSTM is to capture
total documents and the remaining 30% as test set. We use long-term contextual features. Then CRF is applied for
the constructed dataset in the following experiments. The learning and inference. Futhermore, it adds the feature
statistics of datasets are summarized in Table 6. template and extract contextual features of the security
entity through feature templates [24].
B. BASELINE METHODS For HMM, MEMM, and CRF models, we use the de-
In order to select a model for equilibrium accuracy and fault recommended settings. For LSTM-CRF and FT-CNN-
performance, we analyze the following models after doing BiLSTM-CRF models, we set the word embedding layers to
the same rules and dictionary matching preprocessing on our 64, and the word embedding dimensions to 100. Meanwhile,
security samples. for CNN and LSTM models, we set batch_size to 32, and
Dropout to 0.5, and learning rate to 0.01, and gradient to 5 in
1 http://www.cert.org.cn/
2 https://www.anquanke.com/
the following comparison experiments.
3 http://brat.nlplab.org/
VOLUME 4, 2020 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2984582, IEEE Access
0.4
0.3
corpus. Hence, by incorporating regular expression, know-
entity dictionary and CRF model, our proposed model indeed
0.2
Precision
perform well on the cybersecurity entity recognition task.
0.1 Recall
F1
0 2) Comparisons with the state-of-the-art methods
Average Accuracy Overall Accuracy
In order to evaluate and compare the effectiveness, we con-
duct an experiment to compare our method to the latest
FIGURE 2: Performance of our proposed model on the tasks
methods in cybersecurity extraction entities mentioned in
of average classification and overall classification.
the last two years of papers on the same dataset. The first
is LSTM-CRF [23], and the second is FT-CNN-BiLSTM-
C. EVALUATION METRICS CRF [24]. The comparative experiment results are shown in
In this paper, we use three representative metrics to evaluate Table 7.
the performance: Precision, Recall, and F1-measure (F1). A As we can see, the performance metrics show that the
greater Precision, Recall, and F1-measure values mean better results for RDF-CRF are better than other state-of-the-art
performance. Without loss of generality, we split randomly methods. Even though the recall score of the FT-CNN-
with 80% as the training set and 20% as the testing set. BiLSTM-CRF is close to ours, its precision still have some
We repeat each experiment 5 times and report the average room for improvement. One of the reasons is that there are a
performance. large number of simple but regular entities in cybersecurity
texts, such as IP, domain, etc., and the use of complex model
D. PERFORMANCE AND ANALYSIS methods for these entities will reduce its precision. At the
1) The performance of cybersecurity entity recognition same time, due to the use of neural network for feature
The task of entity recognition is divided into two categories: extraction, the computational complexity of the model will
(1) be or not be a entity, which is a binary classification be greatly increased. The final results prove that in the case
task; (2) belong to which entity class, which is a multi- of entity pre-matching using rules and dictionaries, the CRF
classification task. To this end, we conduct extensive experi- model with feature templates can be used to obtain better
ments with the above two tasks on the cybersecurity dataset. recognition results at lower complexity.
The experimental results are shown in Figure 2 and Table 8.
From the Figure 2, we can see the overall accuracy of 3) Comparisons of different recognition models
whether there is a entity is higher than the average accuracy In this section, we mainly compares the performance of cy-
of entity class recognition. We argue that this phenomenon bersecurity entity recognition under Hidden Markov Model
may be caused by confusion in the process of entity classi- (HMM), Maximum Entropy Markov Model (MEMM) and
fication, such as Person be classified as Organization, Threat Conditional Random Fields (CRF). The main classes of
be classified as Hacker_Group, etc. We also can see that the comparison entities are recognized only by statistical model,
binary classification accuracy is only more 6% than that of including Organization, Person, Report, Threat, Event, Con-
multi-classification, which shows our proposed model have ference, Hacker_Group. The experimental results are shown
good robustness. in Figure 3.
On the other hand, from the Table 8, we can also observe From the figure, the experimental results show that CRF
the following conclusion that (1) our proposed model has model always outperforms other comparison methods of all
a relatively high performance at most of entity classes; (2) metrics. The major reason is that the CRF model can make
regular-based entities like CVE and Email can be extracted better use of the sequential state of sentences and its depen-
with a highest accuracy, which demonstrates that regular- dence on features, and has the best effect on the named enti-
based extractor is a good strategy; (3) dictionary-based en- ties recognition in unstructured cybersecurity texts. Through
tities such as Product and Organization have a relatively the analysis of the reasons, it is found that for named entity
high accuracy, and sometimes the improvements are not recognition of unstructured cybersecurity texts, each obser-
statistically significant due to the lack of specific entities; (4) vation value has abundant interacting context features and
CRF-based extractor obtain poor precision and recall as our dependencies. HMM model can choose the best path in the
dataset contains only a small number of these instances. This range of its inference sequence, but its independence as-
problem can be solved given a larger amount of cybersecurity sumption and no aftereffect restrict the selection of features.
8 VOLUME 4, 2020
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2984582, IEEE Access
TABLE 8: Performance of our proposed model with Precision, Recall, F1 on different entity classes
Class Precision Recall F1 Class Precision Recall F1
CVE 1.0000 1.0000 1.0000 Product 0.7579 0.7066 0.7314
AS 1.0000 1.0000 1.0000 Organization 0.8989 0.7366 0.8097
Cert 1.0000 1.0000 1.0000 Person 0.8399 0.7633 0.7998
Host 0.7800 0.8500 0.8135 Place 0.9028 0.8824 0.8925
Domain 0.8225 0.7433 0.7809 Threat 0.8729 0.7536 0.8089
Email 0.8895 0.7965 0.8404 Hacker_Group 0.7500 0.5742 0.6504
MD5 1.0000 1.0000 1.0000 Attack 0.6600 0.5400 0.5940
Registry 0.8901 0.8628 0.8762 Software 0.3396 0.3005 0.3189
SHA1 1.0000 1.0000 1.0000 Protocol 0.8200 0.7800 0.7995
SHA256 1.0000 1.0000 1.0000 Conference 0.6842 0.6023 0.6406
URL 0.9255 0.8700 0.8969 Report 0.6472 0.4821 0.5526
IP 0.9900 0.9900 0.9900 File_Path 0.8936 0.6200 0.7496
File_Name 0.8842 0.8925 0.8883 Event 0.6233 0.3900 0.4798
0.7 0.7
0.6
0.6 0.6
0.5
Precision
0.5 0.5
Recall
F1
0.4
0.4 0.4
0.3
0.3 0.3
0.2
0.2 0.2
(a) Precision vs. Entity Classes (b) Recall vs. Entity Classes (c) F1 vs. Entity Classes
FIGURE 3: Precision, Recall and F1 with different entity classes.
0.9 0.9
0.8 0.8
Performance of our proposed model
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
Precision Precision
0.1 Recall 0.1 Recall
F1 F1
0 0
A A+C A+C+S A+C+S+M 0 2000 4000 6000 8000 10000 12000 14000
Combination of different feature templates Size of dataset
FIGURE 4: Performance of our proposed model with combi- FIGURE 5: Performance of our proposed model under dif-
nation of different feature templates. ferent dataset size.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2984582, IEEE Access
clear that (1) when increasing the amount of combination [4] E. F. Tjong Kim Sang and F. De Meulder, “Introduction to the conll-
templates, the performance of our proposed model improves, 2003 shared task: Language-independent named entity recognition,” in
Proceedings of the seventh conference on Natural language learning at
and the proposed model can achieve best performance by HLT-NAACL 2003-Volume 4. ACL, 2003, pp. 142–147.
using all feature templates; (2) among variants of our pro- [5] A. Ritter, S. Clark, O. Etzioni et al., “Named entity recognition in tweets:
posed model, the improvements are statistically significant an experimental study,” in Proceedings of the conference on empirical
methods in natural language processing. ACL, 2011, pp. 1524–1534.
while using marker feature templates; (3) all of these variants [6] C. N. d. Santos and V. Guimaraes, “Boosting named entity recognition
have big differences with the degrees of improvements in with neural character embeddings,” arXiv preprint:1505.05008, 2015.
some cases. From this view, we conclude that our proposed [7] T.-H. Pham and P. Le-Hong, “End-to-end recurrent neural network models
for vietnamese named entity recognition: Word-level vs. character-level,”
model is a proper choice for improving the performance of in International Conference of the Pacific Association for Computational
cybersecurity entity recognition. Linguistics. Springer, 2017, pp. 219–232.
[8] S. M. Yimam, C. Biemann, L. Majnaric, Š. Šabanović, and A. Holzinger,
“An adaptive annotation approach for biomedical entity and relation recog-
5) Impact of dataset size nition,” Brain informatics, vol. 3, no. 3, p. 157, 2016.
Figure 5 shows the impacts of different dataset sizes on our [9] T. Eftimov, B. K. Seljak, and P. Korošec, “A rule-based named-entity
recognition method for knowledge extraction of evidence-based dietary
proposed model. From the figure, we can observe that the recommendations,” PloS one, vol. 12, no. 6, p. e0179488, 2017.
size of dataset impacts the results of entity recognition signif- [10] C. Lee, Y.-G. Hwang, H.-J. Oh, S. Lim, J. Heo, C.-H. Lee, H.-J. Kim, J.-
icantly. As the cybersecurity data increases, the recognition H. Wang, and M.-G. Jang, “Fine-grained named entity recognition using
accuracy greatly improves, but when the cybersecurity data conditional random fields for question answering,” in Asia Information
Retrieval Symposium. Springer, 2006, pp. 581–587.
surpasses a certain threshold, the recognition accuracy be- [11] M. A. Khalid, V. Jijkoun, and M. De Rijke, “The impact of named
come stable with further increase of the size of dataset. This entity normalization on information retrieval for question answering,” in
phenomenon coincides with the intuition that our proposed European Conference on Information Retrieval. Springer, 2008, pp. 705–
710.
model can efficiently handle different dataset sizes with a [12] Ö. Uzuner, Y. Luo, and P. Szolovits, “Evaluating the state-of-the-art in
significant improvement of recognition power. automatic de-identification,” Journal of the American Medical Informatics
Association, vol. 14, no. 5, pp. 550–563, 2007.
[13] M. Krallinger, O. Rabal, F. Leitner, M. Vazquez, D. Salgado, Z. Lu,
V. CONCLUSION R. Leaman, Y. Lu, D. Ji, D. M. Lowe et al., “The chemdner corpus of
In this paper, we propose a novel security named entity chemicals and drugs and its annotation principles,” Journal of cheminfor-
matics, vol. 7, no. 1, p. S2, 2015.
recognition method by incorporating regular expressions, [14] A. Ritter, E. Wright, W. Casey, and T. Mitchell, “Weakly supervised
known-entity dictionary and conditional random fields. The extraction of computer security events from twitter,” in Proceedings of the
proposed model consists of rule-based extractor, dictionary- 24th International Conference on World Wide Web. International World
Wide Web Conferences Steering Committee, 2015, pp. 896–905.
based extractor and CRF-based extractor. In particular, [15] V. Mulwad, W. Li, A. Joshi, T. Finin, and K. Viswanathan, “Extracting in-
rule-based extractor is designed to locate specific entities, formation about security vulnerabilities from web text,” in Proceedings of
dictionary-based extractor includes known-entity lists, and the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence
and Intelligent Agent Technology-Volume 03. IEEE Computer Society,
CRF-based extractor leverages the identified entities by rule- 2011, pp. 257–260.
based and dictionary-based extractors to further improve the [16] I. Deliu, C. Leichter, and K. Franke, “Extracting cyber threat intelligence
recognition performance. In order to verify the effectiveness from: Support vector machines versus convolutional neural networks,” in
of our proposed method, we construct a standard ground truth 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017,
pp. 3648–3656.
dataset through manually collaborative annotation and per- [17] X. Liao, K. Yuan, X. Wang, Z. Li, L. Xing, and R. Beyah, “Acing the
form extensive experiments. The experimental results show ioc game: Toward automatic discovery and analysis of open-source cyber
that our proposed method can outperform the state-of-the-art threat intelligence,” in Proceedings of the 2016 ACM SIGSAC Conference
on Computer and Communications Security. ACM, 2016, pp. 755–766.
baseline methods. In the future work, we will focus on ex- [18] M. Balduccini, S. Kushner, and J. Speck, “Ontology-driven data semantics
ploring neural network methods to deal with the problem of discovery for cyber-security,” in International Symposium on Practical
label imbalance and feature automatic extraction. The results Aspects of Declarative Languages. Springer, 2015, pp. 1–16.
[19] A. Joshi, R. Lal, T. Finin, and A. Joshi, “Extracting cybersecurity related
of our work will have a positive effect on the extraction of linked data from text,” in Semantic Computing (ICSC), 2013 IEEE Sev-
security knowledge and the construction of knowledge graph. enth International Conference on. IEEE, 2013, pp. 252–259.
[20] R. Lal et al., “Information extraction of security related entities and
concepts from unstructured text,” 2013.
REFERENCES [21] C. L. Jones, R. A. Bridges, K. M. Huffer, and J. R. Goodall, “Towards a
[1] S. Mittal, P. K. Das, V. Mulwad, A. Joshi, and T. Finin, “Cybertwitter: relation extraction framework for cyber-security concepts,” in Proceedings
Using twitter to generate alerts for cybersecurity threats and vulnerabili- of the 10th Annual Cyber and Information Security Research Conference.
ties,” in Proceedings of the 2016 IEEE/ACM International Conference on ACM, 2015, p. 11.
Advances in Social Networks Analysis and Mining. IEEE Press, 2016, [22] R. A. Bridges, C. L. Jones, M. D. Iannacone, K. M. Testa, and J. R.
pp. 860–867. Goodall, “Automatic labeling for entity extraction in cyber security,” arXiv
[2] R. P. Khandpur, T. Ji, S. Jan, G. Wang, C.-T. Lu, and N. Ramakrishnan, preprint arXiv:1308.4941, 2013.
“Crowdsourcing cybersecurity: Cyber attack detection using social me- [23] H. Gasmi, A. Bouras, and J. Laval, “Lstm recurrent neural networks for
dia,” in Proceedings of the 2017 ACM on Conference on Information and cybersecurity named entity recognition,” in Proceedings of the Thirteenth
Knowledge Management. ACM, 2017, pp. 1049–1057. International Conference on Software Engineering Advances. IARIA
[3] G. Husari, X. Niu, B. Chu, and E. Al-Shaer, “Using entropy and mutual in- XPS Press, 2018, pp. 1–6.
formation to extract threat actions from cyber threat intelligence,” in 2018 [24] Ya, G. Qin, W. Shen, Y. Zhao, M. Chen, X. Yu, and Jin, “A network
IEEE International Conference on Intelligence and Security Informatics security entity recognition method based on feature template and cnn-
(ISI). IEEE, 2018, pp. 1–6. bilstm-crf,” Frontiers of IT & EE, vol. 20, no. 6, pp. 872–884, 2019.
10 VOLUME 4, 2020
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2984582, IEEE Access
VOLUME 4, 2020 11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.