Hand Written Recognition
Hand Written Recognition
ABSTRACT Named entity recognition (NER) is the process of categorizing named entities in a given
text that suffers from the lack of labeled corpora, which is a long-standing issue. Deep neural networks
have been successfully applied to NER tasks. However, they require a large number of annotated data.
Regardless of the number of data made available, annotation requires significant human effort, which is
expensive and time-consuming. Moreover, collecting labeled data that reflect contemporary surrounding
statuses requires exhaustive follow-up and incurs correspondingly higher costs. Current NERs typically
focus on the supervised learning of hand-crafted data. The most well-known dataset for NER shared tasks,
which was released at the 2003 Conference on Natural Language Learning, is used for basic training and
evaluation. Although the data are qualified, the database has low coverage of timely material. In this paper,
we illustrate methods for swiftly labeling up-to-date data via distant supervision. To tackle the difficulty
of annotating contemporary written texts, we generate labeled data articles that reflect the latest issues.
We evaluated the proposed methods with bidirectional long short-term memory conditional random-field
architecture using static and contextualized embedding methods. Our proposed models perform higher than
state-of-the-art methods with average F1-scores 3.09% better with weakly labeled Wikipedia data and 3.47%
better with Cable News Network data. When using the NER model with Flair embedding, our method shows
1.50 and 3.26% higher F1-scores with weakly labeled Wikipedia and news data, respectively. Qualitatively,
the proposed model also performs better when extracting contemporary keywords.
INDEX TERMS Computational and artificial intelligence, named entity recognition, natural language
processing, neural networks, transfer learning, weakly supervised learning.
also important. It is a well-known fact that performance this process. These changes allow the proposed method to be
improves when the training and test dataset distributions adopted more flexibly for newly generated texts.
are similar [6]. In fields of natural language processing Using weakly supervised data is advantageous in that
(e.g., document processing [7]–[10]), extant studies have human labeling is not required as long as there is a knowl-
primarily sought to overcome the differences in the domain edge source full of unlabeled data. However, this approach
distributions of training and test datasets. In this study, for suffers from noisy labels that come from automatic align-
NER, we focused not only on the domain distribution, but ment and incomplete knowledge. Therefore, we adopt a
also on the changes in writing styles over time. transfer-learning approach to utilize the weakly supervised
It is evident that the usage of words and styles changes data. Transfer learning is a method that transfers information
over time [11]. Hence, NERs are more affected over time from a source task or dataset to solve the data-shortage prob-
than other natural language tasks. New NEs are consistently lem of a target task. It was shown by Kim et al. that trans-
being produced, because new institutions, organizations, and fer learning could be used to reduce noise while providing
sometimes locations are created. Then, they are named. The abundant information from weakly labeled data for the target
uses and meanings of words continue to change. For example, task.
‘‘Apple’’ was just a fruit before the 1970s, but now it refers In our experiments, we generate weakly labeled data
to a prominent company. Second, the frequency of specific using unlabeled texts from two domains: Wikipedia and
NEs changes with time. For example, the name, ‘‘Dora,’’ was Cable News Network (CNN) news articles. The generated
the 51st most popular female name in the 1880s. However, weakly labeled data is trained with a bidirectional (BI)
by the 1990s, it had fallen to 972nd. If NER training data were long short-term memory (LSTM) conditional random-field
created in 1880, it would have included Dora as the named (CRF) NER model using static and contextualized embedding
entity of person, but it is likely not if it was created in 1990. methods. The proposed NER is evaluated on the CoNLL
Therefore, when we use NEs for contemporary written texts, 2003 benchmark dataset and the latest Wikipedia and CNN
it is beneficial to use the latest styles as training data to ensure news texts. Experimental results show that our method is
high performance. useful and flexible for recognizing NEs on up-to-date texts.
Most recent studies trained and evaluated their NERs using The remainder of this paper is organized as follows.
the Conference on Natural Language Learning (CoNLL) Section II deals with the related work of distant supervision
2003 benchmark dataset [1], which consists of Reuters news and transfer learning. Section III describes the proposed
collected between August 1996 and August 1997. The distri- method for constructing the NER, which is designed to per-
bution of the CoNLL 2003 dataset differs from contemporary form well on contemporary written texts. The experimental
written text. Thus, training with CoNLL 2003 can degrade settings and results are discussed in Section IV. Finally,
the performance of an NER for current real-world issues. The in Section V, we present our conclusion and outline future
best practice is to annotate the written text data manually and work.
consistently update them over time as NEs emerge. However,
this is infeasible, because manual annotation is costly and II. RELATED WORK
time-consuming. Furthermore, new text data, including news Traditional NER approaches were formulated as
and user-generated texts, are continuously being generated. sequence-labeling tasks using an inside–outside–beginning
A previous work by Kim et al. [12] proposed an NER that tagging scheme. In early machine-learning studies, support
performed well on the latest generated texts by generating vector machines [14], hidden Markov models [15], maximum
weakly labeled data and training them. They manually con- entropy models [16], and CRFs [17], [18] necessitated many
structed relations wherein both the subject and object implied handcrafted feature methods, including part-of-speech tag-
the NE using the Freebase [13] database. Then, the two gers, dependency parsing, and other external libraries known
entities connected via the collected relation in Freebase and as gazetteers (e.g., WordNet [19]). The performance of these
automatically annotated the unlabeled data by considering models was heavily dependent on hand-engineered features.
their occurrence in the sentence. Our study is based on this In recent years, BI-LSTM-CRF, which is based on the
work, and we propose an advanced method. DNN model, has exhibited outstanding performance. The
The main differences between the method of Kim et al. DNN-based model is applied to NERs that does not require
and our method can be summarized with two points. The handcrafted features [20]–[22]. LSTM is the most typically
first is that we use Wikipedia as the knowledge source. used architecture for sequence tagging. Huang et al. [3] pro-
Wikipedia is more useful than Freebase, because real-time posed a BI-LSTM-CRF architecture that sent and received
information is updated quickly, and there is a larger volume information in both directions using a CRF layer atop the BI-
of information for the knowledge graph. Thus, it enables more LSTM. BI-LSTM-CRF showed a remarkable improvement
accurate automatic labeling from the increased knowledge in the NER task, and it was the basis for many contem-
base provided. Second, our method does not require any porary studies. Another modified RNN-based model is the
human labeling. In the work of Kim et al., relations were stacked LSTM [23], which learns character-level features
collected via human labeling to obtain pairs of potential by concatenating the vector representations of a BI-LSTM
NEs. Alternatively, we devise a method that does not require over the characters of the input word. Chiu and Nichols [5]
and Ma and Hovy [24] applied a CNN architecture to the methods incorporates domain adaptation into the deep learn-
BI-LSTM-based NER model to extract more features from ing pipeline to learn more transferable representations by
input words and characters. leveraging deep networks, which can which can simultane-
Distant supervision [25] proposed by Mintz et al., is a ously disentangle the explanatory factors of variations behind
type of weak supervision used to automatically obtain labeled data and match the marginal distributions across domains
data. In their work, textual features were extracted to train a [32]–[36].
relation classifier by utilizing the large Freebase knowledge Typical transfer-learning NLP methods incorporate
base by applying two assumptions: if two entities are related pre-training and fine-tuning activities through deep transfer-
in Freebase, and a sentence has two entities, the sentence learning. During pre-training, the model learns from the
will present a relationship. The sentence can then be used training dataset. Then, the information is transferred to the
as a training instance or as a feature design. Our proposed target task. Fine-tuning is the process of tuning the parameters
method begins with applying similar assumptions; details are of the model to improve the performance of target work. Mod-
presented in Section III-A. ern machine-learning models, especially DNNs, significantly
Many prior NER studies proposed methods for generating benefit from transfer learning [37]–[39].
weakly supervised data based on NE dictionary matching.
Yang et al. [26] constructed weakly supervised data by auto- III. PROPOSED METHOD
matically mapping an entity dictionary containing entities Figure 1 presents an overview of our proposed method that
extracted from a small existing NER dataset. Shang et al. [27] consists of two parts: weakly labeled data generation and
proposed neural models that used distant supervision from a model training via transfer learning. Weakly labeled data
manually constructed dictionary. However, simply matching are generated with unlabeled data and knowledge gathered
terms with the NE dictionary in a raw sentence generated sig- from Wikipedia by applying a distant supervision method.
nificant amounts of noise. Considering the ‘‘Apple’’ example, The training is based on transfer learning, which involves
Disney’s ‘‘Snow White’’ story can quickly mislabel ‘‘apple’’ pre-training and fine-tuning.
as the global corporation. Therefore, we need a method to A. WEAKLY LABELED DATA GENERATION BASED ON
generate weakly labeled data that reflects the relationship DISTANT SUPERVISION
between potential NEs and the current knowledge source vis Our proposed method first carries out automatic weak
a vis context, similar to the method of Mintz et al. [25]. labeling of training data based on unlabeled data and a
Transfer learning enables the domains used for training and knowledge source. For weak labeling, we use the link and
testing to be different. Studies on transfer learning have suc- category information of Wikipedia as the knowledge source
ceeded, because the method allows users to apply knowledge and automatically annotate the unlabeled data. We describe
learned previously to solve new problems faster, resulting the characteristics of Wikipedia in the next subsection
in better solutions. The fundamental motivation for transfer (Section III-A1) and demonstrate the gathering of potential
learning in the field of machine learning was first discussed NEs and a method for automatic labeling and constructing
at the NIPS-95 workshop, ‘‘Learning to Learn’’ [28], which the weakly labeled data in Section III-A2.
focused on the need for lifelong machine-learning methods
that retained and reused previously learned knowledge [6]. 1) CHARACTERISTICS OF WIKIPEDIA AS A KNOWLEDGE
Hence, transfer learning has been used widely in image SOURCE
processing areas. We use Wikipedia as a knowledge source for automatic label-
There are mainly two types of transfer learning: shal- ing for the proposed distant supervision approach. Wikipedia
low and deep. Shallow transfer-learning methods con- is a freely available encyclopedia that hosts accessible and
nects the source and target domains without using target editable content. There are currently more than 6-million
labels by learning an invariant representation or estimat- articles in the English version. An article is identified using
ing instance importance. [29]–[31]. Deep transfer-learning a universal resource identifier (URI) to discriminate between
Figure 4 shows the example of the Wikipedia article, pre-training is an auxiliary task for the target task. In contrast,
‘‘Alice’s Adventures in Wonderland,’’ where there are four we use target-related data that contain contemporary written
entities having the hyperlinks of ‘‘novel,’’ ‘‘Lewis Carroll,’’ text for pre-training, and we then use outdated data for fine-
‘‘Alice,’’ and ‘‘anthropomorphic.’’ ‘‘Novel’’ belongs to only tuning. Considering that our purpose is to generate an NER
one category: ‘‘History of literature in the United Kingdom.’’ that works well on contemporary written texts, this approach
The highest category is ‘‘Culture.’’ Although the entities, differs from the original transfer learning in that we utilize
‘‘Alice’’ and ‘‘Lewis Carroll,’’ have subcategories, such as data related to the target domain for auxiliary training.
‘‘Teenage characters in film’’ and ‘‘British surrealist writers’’ The reason of this learning approach is that weakly labeled
(both subcategories of ‘‘People’’), we cannot add < (‘‘Alice,’’ data contain noise. Thus, using it for fine-tuning increases
‘‘Person’’) & (‘‘novel’’, ‘‘Culture’’) >. However, < (‘‘Alice,’’ the opportunity of noise being learned. Therefore, we intend
‘‘Person’’) & (‘‘Lewis Carroll,’’ ‘‘Person’’) > can be added to that the parameters learned from the noise of weakly labeled
potential NE pairs that are used for weak labeling. data are readjusted in the right direction for NER via the
manually labeled data. Because we assume that only outdated
3) WEAK-LABELING STRATEGY data can be used for manually labeling, we use them for fine-
To obtain weakly labeled sentences, we automatically anno- tuning. This approach enables our model to utilize abundant
tate using the following heuristic rule: If two entities in the information of the weakly labeled data while adjusting the
potential NE pair exist in one unlabeled sentence together, error from the weak label using the manually labeled data.
we automatically annotate them in that sentence. Among
the automatically labeled results, we consider only positive C. NER MODELS
samples to become weakly labeled data. The aim is to transfer The proposed method is verified using NER models based
more specific information through the pre-training step to the on BI-LSTM-CRF with static and contextualized embed-
target task. Therefore, negative samples are not used, because ding. The BI-LSTM-CRF model is most widely used for
these samples do not contain NEs. Additionally, this approach sequence labeling problems, because it is specialized for cap-
reduces training costs. turing sequential information between input tokens. We adopt
the model proposed by Lample et al., [4], which com-
B. LEARNING METHOD USING WEAKLY LABELED DATA poses a BI-LSTM-CRF model with additional LSTM-based
In our transfer-learning approach, we use weakly labeled data character-level embeddings. In the embedding layer, the input
containing the latest generated data for pre-training and man- words are mapped to their vector representations and are
ually labeled data consisting of outdated data for fine-tuning. initialized with Glove [41]. The embedding of input tokens is
With general transfer learning, the knowledge gained in the entered into the LSTM network. Then, the network extracts
source domain is used to pre-train the model, and data related the features. To predict the label sequence, the extracted
to the target domain are used for fine-tuning. This is because features are used in the CRF layer. This layer predicts the
TABLE 1. The statistics of used dataset. • Weakly labeled data The weakly labeled data were
generated using the proposed method described in
Section III and used to pre-train the NER model. Note
that we took only positive samples. Thus, the number of
weakly labeled data was less than that of the unlabeled
data.
B. EXPERIMENTAL DETAILS
When adopting the BI-LSTM-CRF model, we used an
LSTM cell [43] having 300 dimensional vectors for for-
label of the current input token by considering the information ward/backward hidden states and a dropout value of 0.5.
of the surrounding labels. It was trained using the Adam optimizer [44] with a learn-
Glove is a type of static word embedding that matches the ing rate of 0.001 and an epsilon of 1e-08. According
same word representations as the embedding vector. In addi- to Section III-C, we added character-level representations
tion, we utilize the contextualized embedding method of using another BI-LSTM. Its forward hidden state and back-
Flair [42] to show the effectiveness of our method in various ward dimensions were both 100. The hyperparameters for
models. The contextualized embedding methods generate dif- pre-training and fine-tuning were the same, except for a
ferent embedding vectors according to the surrounding words mini-batch size of 128 for pre-training and 32 for fine-tuning.
(i.e., contexts) of input tokens from the same word. Using We used pre-trained Glove word embedding, trained using
Flair embedding provides state-of-the-art performance on the a common crawl corpus having 840B tokens, 2.2M vocab,
CoNLL 2003 dataset using the contextualized information cased, and 300d vectors.
between input tokens. This model comprises a contextual To train the Flair-based NER model, we adopted most of
embedding layer and the BI-LSTM-CRF model at the top to the hyperparameters from the study of Akbik et al. [42],
predict the NEs. including LSTM hidden sizes, dropouts, and all values for
IV. EXPERIMENTS character-level language model and the sequence-tagging
This section presents the detailed experimental settings and model. This NER model was optimized using stochastic gra-
results with analysis. dient descent with a learning rate of 0.1. The mini-batch size
was the same as that of our BI-LSTM-CRF model.
A. EXPERIMENTAL SETTINGS
C. EXPERIMENTAL RESULTS AND DISCUSSION
The statistics of manually labeled, unlabeled, and weakly
We conducted various experiments to demonstrate the impact
labeled data are presented in Table 1.
of our proposed method, as described below.
• Manually labeled data We utilized manually labeled
data for training and evaluation. The data for training 1) EFFECTIVENESS OF WEAKLY LABELED DATA FOR
comprised 14,987 sentences and were derived from the TRAINING
CoNLL 2003 dataset [1]. This dataset contains one train- The performance changes gained by adopting the proposed
ing file and two testing files. Among the three, we used method for BI-LSTM-CRF models using Glove and Flair
training data to train the baseline model and to fine-tune are presented in Table 2. The hyperparameters for the
our proposed model. To verify our proposed method, baseline models were the same as the values described in
we used three types of datasets for evaluation: CoNLL, Section IV-B. We marked the highest performances for each
Wiki, and CNNnews. CoNLL covers four types of NE test dataset in bold. The results of the baseline models using
tags: ‘‘persons’’ (PER), ‘‘organizations’’ (ORG), ‘‘loca- manually labeled outdate data only is shown in rows 1—3 and
tions’’ (LOC), and ‘‘miscellaneous names’’ (MISC). The 10—12. Additionally, we present the performance of using
Wiki set comprises randomly selected sentences from weakly labeled data in two different domains: Wikipedia
Wikipedia articles from a January 2018 dump, and it (rows 4–6 and 13–15) and CNN news (rows 7–9 and 16–18).
contains three NE labels: ‘‘persons’’ (PER), ‘‘organiza- The results of the model trained using the proposed weakly
tions’’ (ORG), and ‘‘locations’’ (LOC). The CNNnews labeled datasets, Wikipedia and CNN news, via transfer learn-
dataset was manually labeled using news articles from ing were better than those of the model trained with only
their website, and it covers same NE labels with the Wiki manually labeled data. When Glove embedding was used,
test dataset. Note that CoNLL test set is generated from pre-training with weakly labeled Wikipedia data showed per-
outdated texts, but both Wiki and CNNnews based on formance improvements of 0.18, 3.84, and 5.25 points on
contemporary written texts. CoNLL, Wiki, and CNNnews datasets, respectively. When
• Unlabeled data Two types of unlabeled datasets weakly labeled data of the CNNnews domain was used, per-
were used for automatic labeling. As before, the first formance improvements of 0.38, 3.77, and 6.28 points were
comprised Wikipedia data dumped in January 2018. obtained. For CoNLL and CNNnews test data, using weakly
The second comprised CNN news articles crawled from labeled data of the CNNnews domain was more helpful in
August to September 2018. improving performance. Using Flair embedding experiments
TABLE 2. The performance of NER using CNN news articles for unlabeled data.
showed similar results. Our proposed method improved TABLE 3. The comparison of the existing method and proposed method.
performance on all three datasets.
Interestingly, the performance improved, even with out-
dated data in the CoNLL test dataset. Although the improve-
ments were not significant, we can see that our method
provided more information to NER. Results from the Wiki
and CNNnews test datasets experimentally demonstrate that
our method works well on contemporary written text. Addi-
tionally, we verified the performance improvement in both Therefore, we empirically showed that it is better to utilize
NER models using static (e.g. Glove) and contextualized unlabeled data of the same domain as the target.
embedding (e.g. Flair). These results imply that the proposed
3) PERFORMANCE CHANGES BY WEAK LABELING
method can be effective, even when applied to various other
models. To verify the effects of our proposed method for generating
weakly labeled data, we compared them with the previous
work of Kim et al. using the same BI-LSTM-CRF model and
2) RELATIONSHIP BETWEEN DOMAINS OF UNLABELED AND the Wikipedia unlabeled data. The results on three test sets
TARGET DATA are summarized in Table 3. The main distinction between
We found that the relationship between the domain of unla- the work of Kim et al. and this work is the method used
beled data for weak labeling and that of the test data affected to generate the weakly labeled data. The previous work
performance. As expected, the performance improvement proposed an automatic labeling method by constructing a
was greater when the domain of unlabeled data was the set of relevant relations in Freebase, followed by matching
same as that of the test data. When using the weakly the candidates and plain texts. This work relied on manual
labeled CNNnews dataset for pre-training, the BI-LSTM- labeling to construct relevant relations. We used Wikipedia
CRF model performed +6.28% better on that test set, and links and category information to generate weakly labeled
it was 1.03 points higher than when using weakly labeled data without the need for human labeling.
Wiki data for pre-training. In experiments using the Flair From the experimental results, our proposed method out-
embedding, the performance change was even more pro- performed Kim et al.’s work on all test datasets. Additionally,
nounced. The difference in the domains of the weakly labeled the performance on the CoNLL 2003 dataset improved. How-
data produced a performance difference of 2.63 points on ever, Kim et al. did not improve it. Therefore, we conclude
the CNNnews test set. Using weakly labeled Wiki data that this result can be derived from the information quantity
also showed improved performance on the Wiki test set. and quality growth of the weakly labeled data. That is, our
TABLE 4. The comparison of the existing method and proposed method by NE tag types.
proposed weak-labeling method is better and more useful information to recognize NEs and that transfer learning is a
for target data. Additionally, our weak-labeling method is an useful method of reflecting information from the data.
improvement over the Kim et al method in that ours does not In this experiment, we confirmed that, even when the
require manual labeling at all. distributions of the training and the test data were the same as
those of outdated texts, the information of the weakly labeled
4) PERFORMANCE DIFFERENCE FOR EACH NE TAGS data was transferred, thereby improving the performance.
In Table 4, the NER performances by NE tag types, as evalu- However, the purpose of our study was not to improve the
ated on the CoNLL 2003 dataset, are presented. The base- performance of test data having the same distribution as that
line models were trained with only CoNLL 2003 training of the training data, but to work well with contemporary writ-
data, and the proposed models were pre-trained using weakly ten texts. Therefore, we next present a qualitative evaluation
labeled CNNnews and fine-tuned with CoNLL 2003 training to examine the effect of the proposed model.
data. Note that the NE tags of the CoNLL 2003 dataset
included PER, LOC, ORG, and MISC, whereas weakly 5) QUALITATIVE ANALYSIS OF EFFECTS ON
labeled data contained PER, LOC, and ORG (no MISC). As a CONTEMPORARY WRITTEN TEXTS
result, the performances were increased in the three NE tags, Based on the result of our proposed methods, we found
PER, LOC, and ORG, but the MISC tag performances were how the model adaptively learned from contemporary written
slightly decreased. texts. Considering the CoNLL 2003 dataset, which is widely
We can analyze the reason for the performance increase on used as training data, the model trained with it could not
the three tags in that the NE information acquired from the broadly cope with a neologism. Because our weakly labeled
weakly labeled data was effectively transferred through pre- data provided up-to-date text for modeling to learning, it was
training. More specifically, the proposed pre-training process easily adapted to newly used text. For instance, ‘‘Project
provided more parameter-tuning opportunities for the three Jupyter’’ is a software company founded in 2014. The base-
NE tags, and that tuning helped the model recognize NEs. line model could not extract it as an organization, whereas
If the transferred information in the weakly labeled data were the proposed model predicted it to be an organization. ‘‘The
inaccurate, performance may have declined. However, with Parasite’’ was a movie released in 2019, premiering at the
improvement, it can be seen that the data contain useful 2019 Cannes Film Festival. Our proposed model predicted
‘‘Song Kang-ho,’’ who as a main actor to be a person, whereas [10] X. Ma, P. Xu, Z. Wang, R. Nallapati, and B. Xiang, ‘‘Domain
the baseline model could not. Similarly, the ‘‘COVID-19’’ adaptation with BERT-based domain classification and data selec-
tion,’’ in Proc. 2nd Workshop Deep Learn. Approaches Low-Resource
keyword is prevailing this year, and our model predicted it as NLP (DeepLo@EMNLP-IJCNLP), C. Cherry, G. Durrett, G. F. Foster,
‘‘miscellaneous,’’ because an appropriate category does not R. Haffari, S. Khadivi, N. Peng, X. Ren, and S. Swayamdipta, Eds. Hong
exist yet. Kong: Association for Computational Linguistics, Nov. 2019, pp. 76–83.
[Online]. Available: https://doi.org/10.18653/v1/D19-6109
V. CONCLUSION [11] J. Aitchison, Language Change: Progress or Decay?. Cambridge, U.K.:
Cambridge Univ. Press, 2001.
In this paper, we proposed an NER method that works effi- [12] J. Kim, S. Kang, Y. Park, and J. Seo, ‘‘Transfer learning from automatically
ciently on contemporary written texts. Our main contribution annotated data for recognizing named entities in recent generated texts,’’
is that we improved NER performance by automatically gen- in Proc. IEEE Int. Conf. Big Data Smart Comput. (BigComp), Feb. 2019,
pp. 1–5.
erating weakly labeled training data without the use of man- [13] K. D. Bollacker, R. P. Cook, and P. Tufts, ‘‘Freebase: A shared database
ual labeling. Distant supervision was applied to generate the of structured general human knowledge,’’ in Proc. 22nd AAAI Conf. Artif.
weakly labeled data. With our distant supervision approach, Intell. Vancouver, BC, Canada: AAAI Press, Jul. 2007, pp. 1962–1963.
[Online]. Available: http://www.aaai.org/Library/AAAI/2007/aaai07-
we collected the first potential NE pairs using links and 355.php
category information from Wikipedia. We then automatically [14] H. Isozaki and H. Kazawa, ‘‘Efficient support vector classifiers for named
labeled the unlabeled sentence contents where two potential entity recognition,’’ in Proc. 19th Int. Conf. Comput. Linguistics (COL-
ING), 2002, pp. 1–7. [Online]. Available: https://www.aclweb.org/
NEs appeared together. The weakly supervised data were
anthology/C02-1054
used for pre-training. For fine-tuning, we use the CoNLL [15] G. Zhou and J. Su, ‘‘Named entity recognition using an HMM-based
2003 dataset, which is a manually labeled dataset. chunk tagger,’’ in Proc. 40th Annu. Meeting Assoc. Comput. Linguistics.
We evaluated the proposed model on contemporary written Philadelphia, PA, USA: Association for Computational Linguistics,
Jul. 2002, pp. 473–480. [Online]. Available: https://www.aclweb.
texts using datasets collected from Wikipedia, CNN news, org/anthology/P02-1060
and CoNLL. The experimental results indicated that our pro- [16] O. Bender, F. J. Och, and H. Ney, ‘‘Maximum entropy models
posed method improved the NER performance of all datasets. for named entity recognition,’’ in Proc. 7th Conf. Natural Lang.
Learn. (HLT-NAACL), 2003, pp. 148–151. [Online]. Available:
Furthermore, we showed that our method outperformed the https://www.aclweb.org/anthology/W03-0420
model that utilized other weakly labeled data. We further [17] R. Klinger, ‘‘Automatically selected skip edges in conditional ran-
verified the effectiveness of our method through an exam- dom fields for named entity recognition,’’ in Proc. Int. Conf. Recent
Adv. Natural Lang. Process. Hissar, Bulgaria: Association for Com-
ple analysis of up-to-date text content. We plan to develop putational Linguistics, Sep. 2011, pp. 580–585. [Online]. Available:
a method for excluding noisy examples from the weakly https://www.aclweb.org/anthology/R11-1082
labeled data for training in a future work. [18] M. Marcińczuk, ‘‘Automatic construction of complex features in
conditional random fields for named entities recognition,’’ in Proc.
REFERENCES Int. Conf. Recent Adv. Natural Lang. Process. Hissar, Bulgaria:
INCOMA Ltd., Sep. 2015, pp. 413–419. [Online]. Available:
[1] E. F. Sang and F. De Meulder, ‘‘Introduction to the CoNLL-2003
https://www.aclweb.org/anthology/R15-1054
shared task: Language-independent named entity recognition,’’ 2003,
[19] I. Feinerer and K. Hornik. (2020). WordNet: WordNet Interface.
arXiv:cs/0306050. [Online]. Available: https://arxiv.org/abs/cs/0306050
r package version 0.1-15. [Online]. Available: https://CRAN.R-
[2] R. Weischedel, S. Pradhan, L. Ramshaw, M. Palmer, N. Xue,
project.org/package=wordnet
M. Marcus, A. Taylor, C. Greenberg, E. Hovy, R. Belvin, and A. Houston,
‘‘Ontonotes release 4.0,’’ Linguistic Data Consortium, Philadelphia, PA, [20] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and
USA, Tech. Rep. LDC2011T03, 2011. P. Kuksa, ‘‘Natural language processing (almost) from scratch,’’ J. Mach.
[3] Z. Huang, W. Xu, and K. Yu, ‘‘Bidirectional LSTM-CRF models Learn. Res., vol. 12 pp. 2493–2537, Aug. 2011.
for sequence tagging,’’ 2015, arXiv:1508.01991. [Online]. Available: [21] H. Wei, M. Gao, A. Zhou, F. Chen, W. Qu, C. Wang, and M. Lu,
http://arxiv.org/abs/1508.01991 ‘‘Named entity recognition from biomedical texts using a fusion attention-
[4] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and based BiLSTM-CRF,’’ IEEE Access, vol. 7, pp. 73627–73636, 2019, doi:
C. Dyer, ‘‘Neural architectures for named entity recognition,’’ 2016, 10.1109/ACCESS.2019.2920734.
arXiv:1603.01360. [Online]. Available: http://arxiv.org/abs/1603.01360 [22] D. Zhang, C. Xia, C. Xu, Q. Jia, S. Yang, X. Luo, and Y. Xie,
[5] J. P. C. Chiu and E. Nichols, ‘‘Named entity recognition with bidi- ‘‘Improving distantly-supervised named entity recognition for tradi-
rectional LSTM-CNNs,’’ Trans. Assoc. Comput. Linguistics, vol. 4, tional chinese medicine text via a novel back-labeling approach,’’
pp. 357–370, Jul. 2016. [Online]. Available: https://transacl.org/ojs/index. IEEE Access, vol. 8, pp. 145413–145421, 2020, doi: 10.1109/ACCESS.
php/tacl/article/view/792 2020.3015056.
[6] S. J. Pan and Q. Yang, ‘‘A survey on transfer learning,’’ IEEE Trans. [23] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and
Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010, doi: C. Dyer, ‘‘Neural architectures for named entity recognition,’’
10.1109/TKDE.2009.191. in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics:
[7] J. Blitzer, M. Dredze, and F. Pereira, ‘‘Biographies, bollywood, boom- Hum. Lang. Technol. San Diego, CA, USA: Association for
boxes and blenders: Domain adaptation for sentiment classification,’’ Computational Linguistics, Jun. 2016, pp. 260–270. [Online]. Available:
in Proc. 45th Annu. Meeting Assoc. Comput. Linguistics (ACL), https://www.aclweb.org/anthology/N16-1030
J. A. Carroll, A. van den Bosch, and A. Zaenen, Eds. Prague, Czech [24] X. Ma and E. Hovy, ‘‘End-to-end sequence labeling via bi-directional
Republic: Association for Computational Linguistics, Jun. 2007, pp. 1–8. LSTM-CNNs-CRF,’’ in Proc. 54th Annu. Meeting Assoc. for Com-
[Online]. Available: https://www.aclweb.org/anthology/P07-1056/ put. Linguistics, vol. 1. Berlin, Germany: Association for Compu-
[8] B. Myagmar, J. Li, and S. Kimura, ‘‘Cross-domain sentiment tational Linguistics, Aug. 2016, pp. 1064–1074. [Online]. Available:
classification with bidirectional contextualized transformer language https://www.aclweb.org/anthology/P16-1101
models,’’ IEEE Access, vol. 7, pp. 163219–163230, 2019, doi: [25] M. Mintz, S. Bills, R. Snow, and D. Jurafsky, ‘‘Distant supervision for
10.1109/ACCESS.2019.2952360. relation extraction without labeled data,’’ in Proc. ACL 47th Annu. Meeting
[9] C. Pan, J. Huang, J. Gong, and X. Yuan, ‘‘Few-shot transfer Assoc. Comput. Linguistics 4th Int. Joint Conf. Natural Lang. Process.
learning for text classification with lightweight word embedding (AFNLP), K. Su, J. Su, and J. Wiebe, Eds. Singapore: The Association
based models,’’ IEEE Access, vol. 7, pp. 53296–53304, 2019, doi: for Computer Linguistics, Aug. 2009, pp. 1003–1011. [Online]. Available:
10.1109/ACCESS.2019.2911850. https://www.aclweb.org/anthology/P09-1113/
[26] Y. Yang, W. Chen, Z. Li, Z. He, and M. Zhang, ‘‘Distantly supervised [41] J. Pennington, R. Socher, and C. Manning, ‘‘Glove: Global vectors for
NER with partial annotation learning and reinforcement learning,’’ in word representation,’’ in Proc. Conf. Empirical Methods Natural Lang.
Proc. 27th Int. Conf. Comput. Linguistics (COLING), E. M. Bender, Process. (EMNLP), 2014, pp. 1532–1543.
L. Derczynski, and P. Isabelle, Eds. Santa Fe, NM, USA: Association for [42] A. Akbik, T. Bergmann, and R. Vollgraf, ‘‘Pooled contextualized embed-
Computational Linguistics, Aug. 2018, pp. 2159–2169. [Online]. Avail- dings for named entity recognition,’’ in Proc. Conf. North Amer. Chapter
able: https://www.aclweb.org/anthology/C18-1183/ Assoc. Comput. Linguistics: Human Lang. Technol. (NAACL-HLT), vol. 1,
[27] J. Shang, L. Liu, X. Gu, X. Ren, T. Ren, and J. Han, ‘‘Learn- J. Burstein, C. Doran, and T. Solorio, Eds. Minneapolis, MN, USA: Asso-
ing named entity tagger using domain-specific dictionary,’’ in Proc. ciation for Computational Linguistics, Jun. 2019, pp. 724–728. [Online].
Conf. Empirical Methods Natural Lang. Process., E. Riloff, D. Chiang, Available: https://doi.org/10.18653/v1/n19-1078
J. Hockenmaier, and J. Tsujii, Eds. Brussels, Belgium: Association for [43] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neu-
Computational Linguistics, Nov./Oct. 2018, pp. 2054–2064. [Online]. ral Comput., vol. 9, no. 8, pp. 1735–1780, 1997. [Online]. Available:
Available: https://doi.org/10.18653/v1/d18-1230 https://doi.org/10.1162/neco.1997.9.8.1735
[28] S. Thrun and L. Y. Pratt, ‘‘Learning to learn: Introduction and overview,’’ [44] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimiza-
in Learning to Learn, S. Thrun and L. Y. Pratt, Eds. Boston, MA, USA: tion,’’ in Proc. 3rd Int. Conf. Learn. Represent. (ICLR), Y. Bengio
Springer, 1998, pp. 3–17, doi: 10.1007/978-1-4615-5529-2_1. and Y. LeCun, Eds. San Diego, CA, USA, 2015. [Online]. Available:
[29] J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and http://arxiv.org/abs/1412.6980 and https://dblp.org/db/conf/iclr/index.html
B. Schölkopf, ‘‘Correcting sample selection bias by unlabeled data,’’
in Proc. Adv. Neural Inf. Process. Syst. Proc. 20th Annu. Conf. Neural Inf.
Process. Syst., B. Schölkopf, J. C. Platt, and T. Hofmann, Eds. Vancouver, JUAE KIM was born in Seoul, South Korea,
BC, Canada: MIT Press, Dec. 2006, pp. 601–608. [Online]. Available: in 1993. She received the B.S., M.S., and
http://papers.nips.cc/paper/3075-correcting-sample-selection-bias-by- Ph.D. degrees in computer engineering from
unlabeled-data Sogang University, Seoul, in 2015, 2017, and
[30] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, ‘‘Domain adaptation via 2021, respectively. Since February 2021, she
transfer component analysis,’’ IEEE Trans. Neural Netw., vol. 22, no. 2, has been a Senior Research Engineer with the
pp. 199–210, Feb. 2011, doi: 10.1109/TNN.2010.2091281. Hyundai Motor Group, AIRS Company. Her
[31] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, ‘‘Learning and transferring research interests include natural language pro-
mid-level image representations using convolutional neural networks,’’ in cessing, question-answering system, information
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Columbus, OH, USA, extraction, and machine learning (including deep
Jun. 2014, pp. 1717–1724, doi: 10.1109/CVPR.2014.222. learning).
[32] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, ‘‘Deep domain
confusion: Maximizing for domain invariance,’’ 2014, arXiv:1412.3474.
[Online]. Available: http://arxiv.org/abs/1412.3474 YEJIN KIM received the B.S. and M.S. degrees
[33] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, ‘‘Simultaneous deep in computer science from Sogang University,
transfer across domains and tasks,’’ in Proc. IEEE Int. Conf. Com- Seoul, South Korea, in 2013 and 2015, respec-
put. Vis. (ICCV), Santiago, Chile, Dec. 2015, pp. 4068–4076, doi: tively. She has been a Researcher with the Artifi-
10.1109/ICCV.2015.463. cial Intelligence Laboratory, LG Electronics, since
[34] M. Long, H. Zhu, J. Wang, and M. I. Jordan, ‘‘Unsupervised domain 2015. She is currently working on data gener-
adaptation with residual transfer networks,’’ in Proc. Adv. Neural ation for sequence-tagging tasks. Her primary
Inf. Process. Syst. Annu. Conf. Neural Inf. Process. Syst., D. D. Lee, research interest includes natural language pro-
M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett, Eds. cessing, including information extraction in condi-
Barcelona, Spain: Curran, Dec. 2016, pp. 136–144. [Online]. Available: tions of low-resources, such as with user-generated
http://papers.nips.cc/paper/6110-unsupervised-domain-adaptation-with- text from social-networking services.
residual-transfer-networks
[35] Y. Ganin and V. S. Lempitsky, ‘‘Unsupervised domain adaptation by
backpropagation,’’ in Proc. 32nd Int. Conf. Mach. Learn. (ICML) (Work- SANGWOO KANG received the Ph.D. degree
shop and Conference Proceedings), vol. 37, F. R. Bach and D. M. Blei, in computer science from Sogang University.
Eds. Lille, France: JMLR, Jul. 2015, pp. 1180–1189. [Online]. Available: He was a Research Fellow Professor with Sogang
http://proceedings.mlr.press/v37/ganin15.html University. He has been an Assistant Professor
[36] N. Rupasinghe, A. S. Ibrahim, and I. Guvenc, ‘‘Optimum hovering loca- with the School of Computing, Gachon Univer-
tions with angular domain user separation for cooperative UAV networks,’’ sity, since September 2016. He is currently lead-
in Proc. IEEE Global Commun. Conf. (GLOBECOM), Washington, DC, ing the Natural Language Processing Laboratory,
USA, Dec. 2016, pp. 1–6, doi: 10.1109/GLOCOM.2016.7842113. Gachon University. His specialty is natural lan-
[37] B. Zoph, D. Yuret, J. May, and K. Knight, ‘‘Transfer learning for guage processing, and he is interested in spo-
low-resource neural machine translation,’’ in Proc. Conf. Empirical Meth-
ken dialogue interfaces, information retrieval, text
ods Natural Lang. Process. Austin, TX, USA: Association for Com-
mining, opinion mining, big data, and UI/UX. His recent research interest
putational Linguistics, Nov. 2016, pp. 1568–1575. [Online]. Available:
https://www.aclweb.org/anthology/D16-1163 includes applying deep-learning techniques.
[38] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-
training of deep bidirectional transformers for language understanding,’’ JUNGYUN SEO received the B.S. degree in
in Proc. Conf. North American Chapter Assoc. Comput. Linguistics: mathematics and the M.S. and Ph.D. degrees in
Hum. Lang. Technol., vol. 1. Minneapolis, MN, USA: Association for computer science from the Department of Com-
Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. Available: puter Science, The University of Texas at Austin,
https://www.aclweb.org/anthology/N19-1423 in 1981, 1985, and 1990, respectively. In 1991,
[39] H. Salman, A. Ilyas, L. Engstrom, A. Kapoor, and A. Madry, ‘‘Do adver-
he returned to join as a Faculty Member with the
sarially robust imagenet models transfer better?’’ in Proc. Adv. Neural Inf.
Korea Advanced Institute of Science and Tech-
Process. Syst. Annu. Conf. Neural Inf. Process. Syst. (NIPS), H. Larochelle,
M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds. Dec. 2020. [Online].
nology, Daejeon, where he headed the Natural
Available: https://papers.nips.cc/ Language Processing Laboratory, Department of
[40] A. Gupta, R. Lebret, H. Harkous, and K. Aberer, ‘‘Taxonomy induction Computer Science. In 1995, he moved to Sogang
using hypernym subsequences,’’ in Proc. ACM Conf. Inf. Knowl. Man- University, Seoul, and became a Full Professor, in 2001. He served as
age. (CIKM), E. Lim, M. Winslett, M. Sanderson, A. W. Fu, J. Sun, the President of the Korea Information Science Society, in 2013. His
J. S. Culpepper, E. Lo, J. C. Ho, D. Donato, R. Agrawal, Y. Zheng, research interests include multi-modal dialogues, statistical methods for
C. Castillo, A. Sun, V. S. Tseng, and C. Li, Eds. Singapore: ACM Press, NLP, machine translation, and information retrieval.
Nov. 2017, pp. 1329–1338, doi: 10.1145/3132847.3133041.