research-article

Open access

MedNER: Enhanced Named Entity Recognition in Medical Corpus via Optimized Balanced and Deep Active Learning

Authors:

Xiuxing LiAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 15, Issue 5

Article No.: 108, Pages 1 - 24

https://doi.org/10.1145/3678178

Published: 26 October 2024 Publication History

PDF eReader

Abstract

Ever-growing electronic medical corpora provide unprecedented opportunities for researchers to analyze patient conditions and drug effects. Meanwhile, severe challenges emerged in the large-scale electronic medical records process phase. Primarily, emerging words for medical terms, including informal descriptions, are difficult to recognize. Moreover, although deep models can help in entity extraction on medical texts, they require large-scale labels, which are time-intensive to obtain and not always available in the medical domain. However, when encountering a situation where massive unseen concepts appear or labeled data is insufficient, the performance of existing algorithms will suffer an intolerable decline. In this article, we propose a balanced and deep active learning framework for Medical Named Entity Recognition (MedNER) to alleviate the above problems. Specifically, to describe our selection strategy precisely, we first define the uncertainty of a medical sentence as a labeling loss predicted by a loss-prediction module and define diversity as the least text distance between pairs of sentences in a sample batch computed based on word-morpheme embeddings. Furthermore, aiming to make a trade-off between uncertainty and diversity, we formulate a Distinct-K optimization problem to maximize the slightest uncertainty and diversity of chosen sentences. Finally, we propose a threshold-based approximation selection algorithm, Distinct-K Filter, which selects the most beneficial training samples by balancing diversity and uncertainty. Extensive experimental results on real datasets demonstrate that MedNER significantly outperforms existing approaches.

1 Introduction

Entity extraction is a fundamental and significant component in constructing the medical data platform, which is the crucial system supporting the new research paradigm of big data in healthcare. A wide range of applications in hospitals can be developed using it. For instance, enabling improved structuring of electronic medical records and harnessing the potential of extensive long medical texts. Furthermore, precise extraction of entities from clinical guidelines can assist healthcare professionals in comprehending guidelines swiftly, aiding them in extraction diagnosis and treatment pathways, thus contributing to the development of clinical decision support systems. Researchers have performed entity extraction using various sources, such as Medline case reports [37], X (formerly known as Twitter) posts [9], WebMD doctor consultations [12, 29], and Medicaid Electronic Health Records [20, 25, 31, 33]. Entity extraction in medical field puts forward higher requirements for traditional methods. It is widely used for discovering unreported Adverse Drug Reactions (ADR). Previous deep learning models [10, 39, 44] captured ADRs merely by identifying co-occurrence in sentences that describe medicines and ADRs, but they ignored complex semantic information such as diseases and symptoms. This complex semantic information is often conveyed through latent associations, particularly within medical terminologies. Medical Named Entity Recognition (MedNER) addresses this oversight by capturing such complex semantic information and integrating it with word-level semantic insights. By learning the semantic meanings of these factors, we reduce ambiguity surrounding entities, for instance, distinguishing between symptoms present after taking medicines and those that are actual ADRs. We incorporate extensive useful factors and divide them into four major categories: (1) Disease, (2) Symptom, (3) Medicine, and (4) Adverse Reaction. As shown in Figure 1, we formulate medical entity extraction as a one-pass Named Entity Recognition (NER) task by Natural Language Processing (NLP).

Fig. 1.

The past decade has witnessed the remarkable development of machine learning, which has achieved tremendous success in various downstream application areas. However, existing methods typically require large-scale training data with manual labels, and the performance of existing models will suffer an unendurable decline when annotated data is insufficient, which significantly restricts the applicability of these methods [7]. Unfortunately, the number of annotated data remains inadequate to facilitate the complicated real-world application requirements, especially in the realm of medicine. Furthermore, the process of obtaining annotated data is often extremely tedious and time-consuming, and acquiring annotated data is even practically unavailable in some specific domains. Consequently, preexisting approaches via pre-trained techniques often struggle to break away from data dependencies, leading to suboptimal performance.

As plentiful new or informal terminologies emerge continuously, drug-tracing databases have to be updated dynamically. The limitations of traditional NER techniques arise from their reliance on predefined rules and patterns, which may not adequately capture the complexity and variability of natural language. These techniques failed to handle out-of-vocabulary words or phrases that were not present in the training data. As a result, they may fail to recognize and classify newly found entities accurately [5]. To address this, we propose morpheme embedding to recognize new entities based on identical morphemes in history terminologies. A morpheme is a root or affix of a word, conveying more information than a single character. For instance, if we know about myocarditis and lymphadenitis, it is natural to guess arthritis in Figure 1 is another kind of inflammation by its suffix -tis.

An additional challenge is that the involvement of professional staff is required in medical annotation tasks, which is both costly and time-consuming. To improve its efficiency, active learning is applied to annotate the most valuable samples. Entity extraction by active learning methods often employs criteria of either uncertainty [2, 35] or diversity [34] to guide the selection of informative samples for labeling. These criteria are mathematically based on entropy and symmetric Kullback Leibler (KL) divergence [45]. The uncertainty criterion measures the lack of confidence or predictability in the model’s predictions. Samples with high uncertainty are considered more informative, as they provide an opportunity for the model to learn from challenging or ambiguous instances. The diversity criterion aims to ensure that the selected samples cover a wide range of patterns or concepts in the data. It helps prevent overfitting to specific instances and promotes generalization. By maximizing the KL divergence between selected samples, active learning methods can encourage the exploration of different regions in the data space. A single criteria evaluation often ignores the other criteria, making it difficult to comprehensively improve the overall performance. By combining uncertainty and diversity, active learning methods can effectively select samples that are both informative and representative of the underlying data distribution. This enables efficient training of models with limited label. We make contributions in two aspects: (1) to merge various modules, we compute two criteria at each layer of the model in the current training stage; (2) to balance uncertainty and diversity, we define a Distinct-K optimization problem that maximizes the least uncertainty and diversity of the chosen unlabeled samples. In summary, the main contributions are as follows:

–

We incorporate multiple factors in the medical corpus and formulate a one-pass NER task to extract terminologies. Morpheme embedding is proposed to predict new entities, which is a general embedding for both English and Chinese medical corpus.

–

We propose a balanced and deep active learning framework (MedNER), including an NER model and a threshold-based approximation selection strategy Distinct-K Filter.

–

We demonstrate MedNER’s overwhelming performance in real English and Chinese datasets, and it can be scaled to other text mining tasks easily.

The rest of this article is organized as follows. In Section 2, we introduce the preliminaries and related works. Section 3 describes the overall framework. The details of our model are discussed in Section 4. Section 5 shows the procedures of Distinct-K Filter. Experimental results are demonstrated in Section 6.

2 Preliminaries

In this section, we will first introduce our NER task. Then, we show our MedNER framework and define the Distinct-K optimization problem formally. At last, we review related works.

2.1 NER Task

NER is a task to identify text spans and classify them into predefined entity types [23]. Our NER task is to split the medical corpus into word sequences and annotate labels and text span of multi-class entities. The deep model will be trained on the annotated dataset and utilized to extract medical terminologies.

Definition 1

(Word Sequence). A word sequence \(t\) is generated by splitting a medical sentence word by word. \(|t|\) denotes the number of words in the sentences¹: \(t=\{w_{1},w_{2},\dots,w_{|t|}\}\).

Definition 2

(Labels for Annotation). The labels for annotation can be categorized into two types: prefix and tag (see Table 1). A prefix label indicates where the token is located in the current entity (i.e., “b-” for the starting position of entity, “m-” for a middle position, “e-” for an ending position, and “s-” for a single-word entity). A tag label indicates the entity class (i.e., “D” for disease, “M” for medicine, “S” for symptom, “A” for adverse reaction, “P” for punctuation, “O” for other objects). And the whole label set \(\mathscr{\Omega}\) is defined as follows, where \(\times\) is Cartesian product: \(\Omega=(\{b-,m-,e-,s-\}\times\{D,M,S,A\})\cup\{P,O\}\).

Table 1.

Label	Type	Detail	Label	Type	Detail
D	Tag	Disease	b-	Prefix	Beginning of Entity
M	Tag	Medicine	m-	Prefix	Middle of Entity
S	Tag	Symptom	e-	Prefix	Ending of Entity
A	Tag	Adverse reaction	s-	Prefix	Single-word Entity
P	Tag	Punctuation
O	Tag	Other objects

Table 1. Labels for Annotation

Definition 3

(NER Problem). Given a word sentence \(t=\{w\}\), the embedding of word \(w\) is written as \(x_{w}\). The NER problem is to predict an optimal label sequence \(\{y_{w}\}\) according to input sequence \(\{x_{w}\}\). The ground-truth label sequence is written as \(\{Y_{w}\}\). Both \(y_{w}\) and \(Y_{w}\) belong to the label set \(\Omega\).

For example, in Figure 1, we label “trazodone” as \({\lt}\)s-M\({\gt}\) and label “night sweats” as \({\lt}\)b-A, e-A\({\gt}\). An entity is labeled correctly only when its beginning and ending positions are both located exactly.

2.2 Balanced and Deep Active Learning

To train deep models more efficiently on limited annotated data, we expect to get “high-quality” training samples. Active learning is to select Top-K samples with largest uncertainty. However, it may select similar samples, leading to redundancy and large overhead. To address this problem, we make a trade-off between uncertainty and diversity, selecting Distinct-K samples.

The active learning problem is formulated as follows. \(L\) denotes the number of training stages and is initialized to 0. \(C_{(L)}\) denotes the unlabeled set after the \(L\)-th training stage. The selection strategy of active learning is to select a subset \(M_{(L)}\) of size \(k\) from \(C_{(L)}\). Our MedNER framework differs in that it computes the uncertainty and diversity by exploiting deep features from almost the whole model instead of using only the last prediction layer. Firstly, uncertainty \(\mathcal{U}(t)\) of sentence \(t=\{w\}\) is measured by model’s labeling loss \(l(t)\): \(l(t)=-\frac{1}{|t|}\sum_{w\in t}{\log P(Y_{w}|x_{w})}\),\(\{x_{w}\}\) denotes the input sequence and \(\{Y_{w}\}\) denotes the groundtruth label. As we cannot calculate the real loss of unlabeled data, we use a loss-prediction module as [43] to provide a predicted loss \(\hat{l}(t)\): \(\mathcal{U}(t)=\hat{l}(t)\).

Secondly, diversity of sentence \(t_{i}\) is defined as the least text distance to other ones in \(M_{(L)}\), denoted by \(\mathcal{D}{(t_{i})}\). \(\pi(t_{i},t_{j})\) denotes the text distance between \(t_{i}\) and \(t_{j}\):\(\mathcal{D}(t_{i})=\min_{\begin{subarray}{c}t_{j}\in M_{(L)}\\ t_{i}\neq t_{j}\end{subarray}}{\pi(t_{i},t_{j})}\).

We measure \(\pi(t_{i},t_{j})\) by Word Mover’s Distance (WMD) [21] with the word-morpheme embeddings \(E\) of current deep model. WMD dissects medical sentence distance into sparse distances between terminologies in embedding space. \(E\in R^{|Vob|\times m}\) where \(|Vob|\) is the size of vocabulary and \(m\) is the number of dimension for embeddings: \(\pi(t_{i},t_{j})={\text{\it WMD}}(t_{i},t_{j},E)\).

It is noted that \(\mathcal{U}(t_{i})\) and \(\mathcal{D}(t_{i})\) will be mapped into range \((0,1)\) by sigmoid normalization function respectively: \(norm(z)=\frac{1}{1+{\rm e}^{-\big{(}z-\frac{z_{min}+z_{max}}{2}\big{)}}}\).

Distinct-K Optimization Problem. The goal of our selection strategy is to maximize the least uncertainty and diversity of the sample batch \(M_{(L)}\) as follows: \(f(M_{(L)})=\min_{t_{i}\in M_{(L)}}{\min\big{(}\mathcal{U}(t_{i}),\mathcal{D}(t _{i})\big{)}}\).

The optimal sample batch to be annotated for \((L+1)\)-th stage is denoted as \(M^{*}_{(L)}\), with maximum least uncertainty and diversity: \(M^{*}_{(L)}=\mathop{\arg\max}_{\begin{subarray}{c}M_{(L)}\subseteq C_{(L)},|M_ {(L)}|=k\end{subarray}}{f(M_{(L)})}\).

Proof of NP-hard. We can turn the Distinct-K optimization problem into a graph theory problem by viewing each sample \(t_{i}\) as a vertex and \(\pi(t_{i},t_{j})\) as a weighted edge. Under extreme conditions, we set all \(\mathcal{U}(t_{i})=1\) so that uncertainty can be ignored. Then, we remove all edges whose weights are less than a certain threshold. The problem becomes whether we can find an independent set of size \(k\) in the graph for given \(\theta\). In an independent set, any two different vertices are not adjacent to each other. Since the independent set decision problem is NP-complete, the original problem is NP-hard. In Section 5, we propose an approximate solution by solving a threshold-based maximum independent set problem (MIS).

2.3 Related Works

Word Representation. Text representation has been a fundamental part of NLP. Lexicon-based methods [41] manually collect concepts and simply search keywords in documents. Manning et al. [26] represent each word as a one-hot vector of vocabulary size. These methods don’t take semantic relatedness into account. Word embedding is proposed for NER in [36] to capture semantic information. Mikolov et al. [28] propose a continuous bag-of-words model (CBOW) and it combines context words to predict the target word. Recently, character embedding [5] and subword representation [3] have been proposed to supplement word embedding in word similarity and analogical reasoning tasks. Note that a character can embed rich internal meanings in a language like Chinese. However, English characters cannot convey as rich information as Chinese characters do, with only 26 uppercase/lowercase letters in the English alphabet. We observe that two similar medical terminologies possibly contain the same root or affix. We propose word-morpheme embedding and it is a more general embedding in medical corpus and helps predict newly found entities.

NER. The first studies rely on hand-crafted features. The ProMiner system [16] recognizes gazettes through matching multi-word names with protein and gene identifiers from preprocessed synonym dictionaries. And soon, Lafferty et al. introduced conditional random fields (CRF) to approach the label bias problem [22].

However, the recurrent neural network (RNN) based on deep features shows advantages in variable length input such as machine translation [8], language modeling [27], and speech recognition [13]. The reason is that Long Short-Term Memory (LSTM) units have long-term memories for information of previous time-steps and perform better in determining the output of the current step. Chiu et al. [6] investigate detecting word and character representations with a hybrid bidirectional LSTM (BiLSTM) and convolutional neural network (CNN) architecture. In this article, we adopt an encoder-decoder architecture in [1] with attention mechanism. The attention distribution can give a signal on where to attend to more by capturing context information [9, 24].

Active Learning. Active learning is to achieve a better performance with limited labeled data by selecting \(k\) samples optimally from \(n\) unlabeled data pool. There are two mainstream approaches: (1) selecting samples with the most uncertainty [2, 43] and (2) selecting an optimal subset based on diversity [14, 34]. The uncertainty-based framework prefers hard samples but is easily plunged into outliers. The diversity-based framework selects uniformly distributed samples in global space but cannot filter out easy samples that are scarcely useful to the current model.

Pioneering research on combining two criteria [11, 42] is instance mode active learning, which chooses a single data point each time, leading to sub-optimal performance. But, the emergence of crowdsourcing platforms, like Amazon Mechanical Turk, makes multi-worker labeling simultaneously a reality. Thus, batch mode active learning (BMAL) is proposed in many fields, including text mining [17], image classification [4, 42] and domain adaption [19]. Taking the state-of-the-art method BatchRand [4] as an example, it focuses on maximizing the summation of uncertainty and diversity and provides a solution based on semi-deﬁnite programming relaxation. Unfortunately, it ignores the weights of uncertainty and diversity and may choose samples with high diversity but low uncertainty.

Another problem to be reckoned with is that most active learning methods measure criteria using only prediction layer of deep model, such as maximum normalized log-probability for uncertainty [35] and symmetric KL divergence for diversity [45]. Our framework MedNER provides a feasible and efficient solution to criteria measurement and active selection.

3 Framework

Our MedNER framework includes an NER model and a threshold-based approximation selection strategy Distinct-K Filter. The deep model is iteratively trained. As shown in Figure 2, suppose it is the \(L\)-th training stage, and we have to select \(k\) samples from the remaining unlabeled data \(C_{(L)}\). We compute uncertainty by loss-prediction module and diversity by WMD based on word-morpheme embeddings. Then, we select an optimal \(k\)-size batch \(M_{(L)}\) using Distinct-K Filter. Finally, \(M_{(L)}\) are annotated manually and will be input to model in the \((L+1)\)-th training stage.

Fig. 2.

NER Model. The NER model consists of embedding module, sequence-labeling module, and loss-prediction module. The architecture of our NER model is shown in Figure 3.

Fig. 3.

First, a sentence is input into the embedding module as a word sequence. We take ‘ache’ as a whole to encode its semantic information into word embedding and split the word into morphemes to encode its internal meanings. We concatenate word embedding and its morpheme embeddings as final embedding vectors.

Second, in the sequence labeling module, we utilize a BiLSTM to encode word vectors into hidden states. For decoder, we utilize a uni-directional RNN layer with an attention mechanism to decode hidden states, producing outputs for labeling task.

Third, we attach the loss-prediction module to sequence-labeling module for predicting labeling loss of unlabeled sentence. It reduces the middle outputs between layers, blue-colored in Figure 3, to feature vectors by a fully connected layer and ReLU function. And then feature vectors are concatenated and pass through another fully-connected layer to predict the labeling loss. We train the sequence labeling module and loss-prediction module jointly, so we define the training loss of our whole model as the summation of labeling loss and loss-prediction loss.

Selection Strategy. We fully leverage deep features to measure uncertainty and diversity. For uncertainty, the loss-prediction module is utilized to estimate how well the model in this stage predicts the sentences. For diversity, WMD [21] is utilized to compute text distance between sentences in that it can filter out similar sentences in semantic level. Note that we do not need to consider the text distance between unlabeled samples in the current stage and labeled samples in previous stage, because if some unlabeled sentence \(t_{i}\) is similar to a labeled sentence \(t_{j}\), the model has learned information of \(t_{j}\) in previous stages and will predict accurately on \(t_{i}\). And the loss-prediction module will predict a minor loss for sentence \(t_{i}\) and thus \(t_{i}\) can scarcely be selected.

The Distinct-K optimization problem is an NP-hard problem. We approximate it with a threshold-based MIS and then figure out an optimal threshold by searching its lower/upper bounds, which will be discussed in detail in Section 5.

Our framework is novel and efficient for two advantages. First, we merge deep model and active learning by computing uncertainty and diversity, respectively, via loss-prediction module and word-morpheme embedding module. We take almost the whole layers into consideration and make the best of deep features. Second, Distinct-K Filter maximizes the least uncertainty and diversity meantime.

4 NER Model

The NER architecture is described in Figure 3. We will discuss the word-morpheme embedding module, sequence-labeling module, and loss-prediction module, respectively.

4.1 Word-Morpheme Embedding Module

Word embeddings can help capture semantic relationships between medical terms, such as drug effects on diseases, while morpheme embeddings carry more internal meanings of entities and help discover new terms with known morphemes. Thus, we represent a word using the concatenation of word embeddings and morpheme embeddings.

Word Embedding. The deep model learns the representation of a word and encodes its information into a low-dimensional vector \(e(w)\in R^{m}\). \(m\) is defined as the dimension of word vector.

Morpheme Embedding. Considering that medical terminologies with the same root or affix are usually neighboring words or even synonyms, we propose morpheme embedding to supplement word embeddings. In English, we observe “itis” may indicate inflammation (i.e., enteritis, myocarditis), “dysp” may indicate inability (i.e., dyspepsia, dyspnea), “ule” may indicate erythema (i.e., macule, papule, pustule) and so on. In Chinese, ‘疹’ may indicate skin symptoms (i.e., 风疹-rubella,斑丘疹-maculopapule), ‘注射液’ may indicate injection (i.e., 氯化钠注射液-sodium chloride injection) and etc.

For simplicity, we just consider morphemes of length \(d\) from 1 to 5 because most morphemes are usually not too long. Of course, morphemes of the same size may have different influences on English and Chinese corpus, and we will compare them in experiments. Given a word \(w=\{c_{1},c_{2},\dots,c_{|w|}\}\) and \(c\) is a character. If \(d=3\), we obtain a morpheme sequence \(mor^{w,3}\) as follow. \(\#\) denotes an empty character to distinguish \(\#c_{1}c_{2}\in mor^{w,3}\) from \(c_{1}c_{2}\in mor^{w,2}\): \(mor^{w,3}=\{\#c_{1}c_{2},c_{1}c_{2}c_{3},\dots,c_{|w|-1}c_{|w|}\#\}\).

We treat each morpheme just as a word to learn morpheme embedding \(e(mor^{w,d}_{i})\). The word representation contributed by a \(d\)-size morpheme sequence is denoted as \(\tau^{w,d}\) and \(d\in\{1,2,3,4,5\}\): \(\tau^{w,d}=\sum_{i=1}^{|w|}e(mor^{w,d}_{i})\).

Finally, the embedding vector \(x_{w}\) of word \(w\) is defined as the concatenation of word vector and five morpheme vectors: \(x_{w}=[e(w),\tau^{w,1},\tau^{w,2},\tau^{w,3},\tau^{w,4},\tau^{w,5}]\).

The word-morpheme embedding module lays the foundation of our NER model. Moreover, the embeddings we obtained in this step will be utilized to compute the text distance between sentences.

4.2 Sequence Labeling Module

The sequence-labeling module is the target model in our task. It first encodes the word-morpheme embeddings and then decodes hidden states to predict the tag of each word. Besides, it provides middle outputs for the loss-prediction module described in Section 4.3.

Encoder. We pass the input sequence \(\{x_{i}\}\) through a BiLSTM, which encodes contextual embeddings for each time-step.\(\overrightarrow{h}_{i}=\overrightarrow{LSTM}(\overrightarrow{h}_{i-1},x_{i}), \overleftarrow{h}_{i}=\overleftarrow{LSTM}(\overleftarrow{h}_{i+1},x_{i}),h_{i }=[\overrightarrow{h}_{i},\overleftarrow{h}_{i}]\).

Decoder. Since the fixed-length decoder vector is a bottleneck, we utilize a uni-directional LSTM with attention mechanism [1]. Because the attention mechanism helps the model predict target words based on context vectors from LSTM units.

We define the alignment vector \(A_{ij}\) as how well the input vector at position \(j\) and the output vector at position \(i\) match. \(h_{j}\) is the encoder hidden state at time step \(j\) and \(s_{i-1}\) is previous decoder hidden state. And the attention distribution \(\alpha_{ij}\) is calculated by softmax function. \(\alpha_{ij}\) reflects the importance of \(h_{j}\) with respect to \(s_{i-1}\) in determining the hidden state in current step \(s_{i}\): \(A_{ij}=align(s_{i-1},h_{j})\), \(\alpha_{ij}=\frac{e^{A_{ij}}}{\sum_{k=1}^{|t|}e^{A_{ik}}}\). And we compute \(con_{i}\) as the attention weighted sum of hidden states \(h_{i}\). The context vector \(con_{i}\) focuses on \(i\)-th word in decoder and gathers semantic information from the whole input sequence: \(con_{i}=\sum_{j=1}^{|t|}\alpha_{ij}h_{j}\).

Next, we compute the decoder hidden state \(s_{i}\) based on the decoder state on previous time-step \(s_{i-1}\), the label predicted on previous step \(y_{i-1}\) and the context vector on current step \(con_{i}\): \(s_{i}=net(s_{i-1},y_{i-1},con_{i})\).

At last, the sequence labeling module outputs label sequence \(\{y_{w}\}\) for word sequence \(t=\{w\}\). It also outputs label loss \(l(t)\) that is mentioned before in Section 2.2.

4.3 Loss-Prediction Module

In order to compute labeling loss without labeled inputs, we attach a loss-prediction module as [43] to predict labeling loss. But we use the logMSE function to smooth scale change and gradient of MSE loss.

As shown in Figure 3, we add a fully connected layer and ReLU function after each middle output to generate feature vectors and give a predicted label loss \(\hat{l}(t)\). The reason is that features from middle outputs contains information about how well the current model knows about the input sentence.

Since loss-prediction module is trained with our target model jointly, its negative disturbance should be reduced to the least extent. So we carefully consider how to define the loss of loss-prediction module, or loss-prediction loss \(\Gamma(l,\hat{l})\). Conventional MSE loss \((l-\hat{l})^{2}\) doesn’t work well because its rough scale changes influence the training of sequence-labeling module. The gradient of MSE is derived as follows: \(\frac{\partial(\hat{l}-l)^{2}}{\partial\hat{l}}=2(\hat{l}-l)\).

The difference between \(\hat{l}\) and \(l\) may change sharply in warm epochs, which is inevitable for training model from scratch. In view of this, we propose logMSE function to smooth the MSE gradient: \(\Gamma(l,\hat{l})={{logMSE}}(l,\hat{l})=\ln(1+(\hat{l}-l)^{2})\).

And the gradient of logMSE function is derived as the equation below. When \(l-\hat{l}\) is large, the gradient of logMSE won’t exceed 1. And when \(l-\hat{l}\) becomes small enough, the gradient is close to MSE loss. Consequently, it won’t change vastly, disturbing the target model: \(\frac{\partial\ln(1+(\hat{l}-l)^{2})}{\partial\hat{l}}=\frac{2(\hat{l}-l)}{1+(\hat{l}-l)^{2}}\).

4.4 Training Loss

Finally, we are going to discuss the training loss of the whole model, including sequence-labeling module and loss-prediction module.

The proposed logMSE function is adopted and the training loss \(l_{train}\) is calculated as the following equation: \(l_{train}=l+\Gamma(l,\hat{l})\). After describing the NER model, we will talk about our active learning strategy in Section 5.

5 Active Learning

In this section, active learning is introduced to train our deep model for lower cost and higher efficiency. We propose Distinct-K Filter, which approximates the Distinct-K optimization issue with a threshold-based MIS. And we discuss about searching the lower and upper bounds of the optimal threshold and prove its correctness. Also, the comparative ratio between the result of Distinct-K Filter and the optimal result is analyzed.

5.1 Selection by Distinct-K Filter

Suppose the unlabeled data is \(C_{(L)}\) after the \(L\)-th training stage, we approach the Distinct-K optimization problem in Section 2.2 by constructing a threshold-based graph \(G(C_{(L)},\theta)\) accordingly.

First, we model each sentence as a vertex with value \(\hat{l}(t_{i})\) and the distance between each pair of samples as an edge with weight \(\pi(t_{i},t_{j})\), thus generating the undirected graph \(G(C_{(L)},\theta)\). Then we remove vertices with \(\hat{l}(t_{i}){\lt}\theta\) and remove edges satisfying \(\pi(t_{i},t_{j})\geq\theta\). Now we choose vertices that are not adjacent to each other, it is easy to see that the distance between any two of them is no less than \(\theta\). A set of vertices that are not adjacent to each other is an independent set of graph \(G\), defined below.

Definition 4

(Independent Set). An independent set \(I\) is a subset of a graph \(G\)’s vertices that for any two different vertices \(u,v\in I\), there is no edge in \(G\) connecting \(u\) and \(v\) directly. An independent set \(I\) of \(G\) is denoted as \(I\rightarrow G\).

Definition 5

(Maximum Independent Set). A maximum independent set \(I\) of a graph \(G\) means that for any independent set \(I^{\prime}\) of \(G\), the size of \(I\) or the number of vertices in \(I\) is no less than that of \(I^{\prime}\). A maximum independent set \(I\) of \(G\) is denoted as \(I\mapsto G\).

Example 1 as shown in Figure 4, Vertices of independent sets are colored in red. In (a), \(V_{1}=\{B,C,J,H\}\rightarrow G\) and its size is 4. In (b), \(V_{3}=\{B,C,D,G,I,K,L\}\mapsto G\) because we cannot find an independent set with size larger than 7.

Fig. 4.

Obviously, a vertex set \(M_{(L)}\) of size \(k\) is just a solution to our Distinct-K problem and what is left is to find one with largest threshold \(\theta\). We achieve this by running the Distinct-K Filter step by step.

–

Construct A Graph We view sample \(t_{i}\) from \(C_{(L)}\) as a vertex, each pair \((t_{i},t_{j})\) as a weighted edge.

–

Remove Edges Before that, we have to choose a threshold \(\theta(0{\lt}\theta{\lt}1)\). Then we remove those edges \((t_{i},t_{j})\) if \(\pi(t_{i},t_{j})\geq\theta\).

–

Remove Vertices Remove a vertex \(t_{i}\) if \(\hat{l}(t_{i}){\lt}\theta\). So far we get a graph named as \(G(C_{(L)},\theta)\).

–

Find A Maximum Independent Set We calculate a maximum independent set \(I\mapsto G(C_{(L)},\theta)\) and \(I\) is right a largest sample batch satisfying \(\min\big{(}\mathcal{U}(t_{i}),\mathcal{D}(t_{i})\big{)}\geq\theta\) for each \(t_{i}\).

–

Adapt Threshold If the size of independent set \(M_{(L)}\) is less than \(k\) we decrease \(\theta\); otherwise we increase \(\theta\). We re-run our algorithm to find a \(|M_{(L)}|\) of size \(k\) with a maximal threshold \(\theta\).

The pseudo code for Distinct-K Filter is shown in Algorithm 1. In Section 5.2, we will prove the correctness of the method to adapt threshold \(\theta\). \({{ApprMIS}}(G,k)\) is an approximate solution to the MIS, which will be discussed in Section 5.3.

5.2 Estimating Lower/Upper Bounds of \(\boldsymbol{\theta}\)

Now, we explain how to adapt the threshold \(\theta\) on graph \(G(C_{(L)},\theta)=\{V(C_{(L)},\theta),E(C_{(L)},\theta)\}\) based on our two observations. \(V(C_{(L)},\theta)\) is the vertex set and \(E(C_{(L)},\theta)\) is the edge set. Furthermore, we formulate these two observations into theorems on threshold-based monotonicity and prove them respectively.

Observation 1. We claim that the smaller \(\theta\) becomes, the more vertices graph \(G\) will have.

Example 2 as is compared in Figure 5(a) and (b), this conclusion is easy to see. \(\{A,B,C\}\), which are drawn in dotted line, exist in \(V(C_{(L)},0.4)\) because \(\hat{l}\) of them are greater than or equal to 0.4. However, they are not found in \(V(C_{(L)},0.5)\) because \(\hat{l}\) of them are smaller than 0.5. The higher \(\theta\) is, the more vertices will be removed from \(G\). In other words, \(V(C_{(L)},\theta_{1})\subseteq V(C_{(L)},\theta_{2})\) when \(\theta_{1}{\gt}\theta_{2}\).

Fig. 5.

Theorem 1

(\(V(C_{(L)},\theta)\)-Monotonicity). Given an unlabeled data pool \(C_{(L)}\), two thresholds \(0\leq\theta_{2}{\lt}\theta_{1}\leq 1\), if \(V(C_{(L)},\theta_{1})\) is the set of vertices in \(G(C_{(L)},\theta_{1})\) and \(V(C_{(L)},\theta_{2})\) is the set of vertices in \(G(C_{(L)},\theta_{2})\), then \(V(C_{(L)},\theta_{1})\) is a subset of \(V(C_{(L)},\theta_{2})\).: \(V(C_{(L)},\theta_{1})\subseteq V(C_{(L)},\theta_{2})\).

Proof.

Consider a vertex \(t_{i}\in V(C_{(L)},\theta_{1})\). The Remove vertices operation guarantees that for any vertex \(t_{i}\in C_{(L)}\), the labeling loss \(\hat{l}(t_{i})\) is no less than \(\theta_{1}\). \(\hat{l}(t_{i})\geq\theta_{1}\) Since \(\theta_{2}{\lt}\theta_{1}\), it is obvious that \(\hat{l}(t_{i})\geq\theta_{1}{\gt}\theta_{2}\). So the vertex \(t_{i}\) exists in \(V(C_{(L)},\theta_{2})\) as well. \(t_{i}\) can be any vertex in \(V(C_{(L)},\theta_{1})\), so \(V(C_{(L)},\theta_{1})\) is a subset of \(V(C_{(L)},\theta_{2})\), \(V(C_{(L)},\theta_{1})\subseteq V(C_{(L)},\theta_{2})\). □

Observation 2. We claim that given \(\theta_{1}{\gt}\theta_{2}\), if \(I\) is an independent set of \(G(C_{(L)},\theta_{1})\), then \(I\) must be a independent set of \(G(C_{(L)},\theta_{2})\) as well.

Example 3 as shown in Figure 5, we denote the vertex sets in red from (c)(d) as \(I(C_{(L)},0.5)\) and \(I(C_{(L)},0.4)\), respectively.

Although \(G(C_{(L)},0.4)\) has 7 new edges which are drawn in a blue dotted line in (c) compared to \(G(C_{(L)},0.5)\), none of these edges connects two vertices both from \(I(C_{(L)},0.5)\). The reason is that a new edge must connect a new vertex (i.e., one of \(\{A,B,C\}\)).

To ensure this, we take two vertices \(F\) and \(G\) from \(I(C_{(L)},0.5)\) as an example. \(F\) and \(G\) are disconnected in \(G(C_{(L)},0.5)\), indicating that distance \(\pi(F,G)\geq 0.5\). According to Theorem 1, \(F\) and \(G\) both belong to \(V(C_{(L)},0.4)\). When \(\theta_{2}=0.4\), \(F\) and \(G\) are still disconnected in that \(\pi(F,G)\geq 0.5{\gt}0.4\). In another word, \(I(C_{(L)},0.5)\) in (c) is still an independent set in (d). Even \(I(C_{(L)},0.5)\cup\{D,E\}\) is a larger independent set in (d). So we come to the conclusion that \(I(C_{(L)},0.5)\subseteq I(C_{(L)},0.4)\).

Theorem 2

(\(I(C_{(L)},\theta)\)-Monotonicity). Given an unlabeled data pool \(C_{(L)}\) and two thresholds \(0\leq\theta_{2}{\lt}\theta_{1}\leq 1\), if an independent set \(I(C_{(L)},\theta_{1})\rightarrow G(C_{(L)},\theta_{1})\), then we have \(I(C_{(L)},\theta_{1})\rightarrow G(C_{(L)},\theta_{2})\) as well.

Proof.

Now consider any pair of vertices \(t_{i},t_{j}\in I(C_{(L)},\theta_{1})\). \(t_{i},t_{j}\) are also in the set \(V(C_{(L)},\theta_{1})\) since \(I(C_{(L)},\theta_{1})\subseteq V(C_{(L)},\theta_{1})\).

By Theorem 1 , we know that \(V(C_{(L)},\theta_{1})\subseteq V(C_{(L)},\theta_{2})\). Thus, \(t_{i},t_{j}\) both belong to \(V(C_{(L)},\theta_{2})\), \(t_{i},t_{j}\in V(C_{(L)},\theta_{2})\).

\(t_{i},t_{j}\in I(C_{(L)},\theta_{1})\) also means that \(\pi(t_{i},t_{j})\geq\theta_{1}\). On the assumption \(\theta_{1}{\gt}\theta_{2}\), we conclude that \(\pi(t_{i},t_{j})\geq\theta_{1}{\gt}\theta_{2}\). \(\pi(t_{i},t_{j}){\gt}\theta_{2}\) means \(t_{i}\) and \(t_{j}\) are not adjacent to each other in \(G(C_{(L)},\theta_{2})\). So \(I(C_{(L)},\theta_{1})\) is also an independent set of \(G(C_{(L)},\theta_{2})\): \(I(C_{(L)},\theta_{1})\rightarrow G(C_{(L)},\theta_{2})\). □

Search lower/upper bounds. So far, we can estimate the bounds of threshold \(\theta\). Initially, we initialize the lower bound \(\theta_{lb}=0\) and the upper bound \(\theta_{ub}=1\). In the \(L\)-th training stage, we find out an maximum independent set \(I(C_{(L)},\theta)\) of \(G(C_{(L)},\theta)\). If the size of \(I(C_{(L)},\theta)\) is smaller than \(k\), then we set the upper bound \(\theta_{ub}=\theta\) and choose a smaller threshold \(\theta^{-}\); otherwise, we set \(\theta_{lb}=\theta\) and choose a bigger threshold \(\theta^{+}\). We keep running our algorithm until \(\theta_{ub}-\theta_{lb}\leq\epsilon\) and \(\epsilon\) denotes the gap limit. We adopt a binary search method, and it is guaranteed to find an optimal threshold within \(\log_{2}\frac{1}{\epsilon}\) times. The pseudo code for bound estimation is shown in Algorithm 2, and the analysis of Comparative Ratio can be formulated as: \(\theta=\theta_{lb}\leq\theta^{*}{\lt}\theta_{ub}\leq\theta+\epsilon\), \(CPR=\frac{f(M^{\text{\it DKF}})}{M^{*}}=\frac{\theta}{\theta^{*}}{\gt}\frac{\theta }{\theta+\epsilon}\).

5.3 Computing Maximum Independent Set

Finally, we discuss about how to figure out an approximate maximum independent set of the constructed graph \(G(C_{(L)},\theta)\) efficiently.

The exact algorithm of the MIS is proposed by Xiao [40], a polynomial space solution with time complexity \(O(1.1996^{n})\). When \(n\) reaches 130, the algorithm has a time complexity larger than \(10^{10}\) and cannot be scalable to larger datasets.

For time efficiency, we provide a greedy algorithm, ApprMIS, to approximate the maximum independent set solution. That is, at each step, we choose the vertex with the minimum degree in the graph \(G(C_{(L)},\theta)\) and remove its neighbor vertices. It has been proved to achieve an approximation ratio of \(\frac{\Delta+2}{3}\) in a graph with degree \(\Delta\) by [15].

Definition 6

(Neighbors of vertex). The neighbors of vertex \(t_{i}\) denotes the vertices that are adjacent to \(t_{i}\) in the graph \(G(C_{(L)},\theta)\): \(\mathcal{N}(t_{i})=\{t_{j}|\pi(t_{j},t_{i}){\lt}\theta,t_{j}\in V(C_{(L)},\theta), t_{j}\neq t_{i}\}\).

Definition 7

(Degree of vertex). The degree \(\delta(t_{i})\) of vertex \(t_{i}\) denotes the number of its neighbors in graph \(G(C_{(L)},\theta)\), \(\delta(t_{i})=|\mathcal{N}(t_{i})|\).

Definition 8

(Degree of graph). The degree \(\Delta\) of graph \(G(C_{(L)},\theta)\) is denoted as the most neighbors a vertex has: \(\Delta=\max_{t_{i}\in G(C_{(L)},\theta)}{\delta(t_{i})}\).

Analysis of Time Complexity. The ApprMIS algorithm has to maintain neighbors \(\mathcal{N}(t_{i})\) and degrees \(\delta(t_{i})\) to find the vertex with the least number of neighbors. So, the time complexity of ApprMIS is \(O(n^{2})\). Besides, we have to search the lower/upper bounds of the desirable threshold and run the ApprMIS algorithm at least \(\log_{2}\frac{1}{\epsilon}\) times. Thus, the total time complexity of our algorithm is \(O(n^{2}\cdot\log_{2}\frac{1}{\epsilon})\). The pseudo code for ApprMIS is shown in Algorithm 3.

6 Experiment

Datasets. For the English dataset, we use the dataset from PSB 2016 Social Media Shared Task [32] and re-label 3,316 X posts. 3,285 posts contain medicine, 1,043 ones contains disease, 790 ones contain symptom, and 1,643 ones contain adverse reaction. For the Chinese dataset we use a clinical dataset from the State Food and Drug Administration of China.² It consists of 3,870 clinical records. 3,822 records contain medicine, 1,594 ones contain disease, 1,955 ones contain symtom, and 1,341 ones contain the adverse reactions. Chinese records are preprocessed by the state-of-the-art Chinese word segmentation tool THULAC.³ The ratio of training and test sets is 80% and 20%, respectively. We imported the Python Pandas and Regex modules to implement this portion of the content.

Training Details. We obtain initial word embeddings for NER model using Glove [30]. The dimension of word-morpheme embedding is 200, and the number of units in LSTM cell is 128. We perform mini-batch training with batch size set to 8. For the active learning part, the fully connected layers of loss-prediction module except for the last one output a 256-dimension feature vector, shown in Figure 3. The initial learning rate is \(2\%\), and the gap limit \(\epsilon\) for bound estimation in Section 5 is set to 0.01. We used Pytorch, Math, Numpy, Scipy, and OpenAI Automations module in implementation.

Metrics. We compute precision (P), recall (R) and F-1 score (F1) for each type of entity. Given a particular entity type \(c\), precision \(P_{c}\) and recall \(R_{c}\) can be computed by \(P_{c}=\frac{TP_{c}}{TP_{c}+FP_{c}}\) and \(R_{c}=\frac{TP_{c}}{TP_{c}+FN_{c}}\) respectively, where \(TP_{c}\) counts the number of true positive samples from class \(c\), \(FP_{c}\) counts the number of false positive ones and \(FN_{c}\) counts the number of false negative ones. And the F-1 score \(F1_{c}\) is computed from the precision and recall as \(F1_{c}=\frac{2P_{c}R_{c}}{P_{c}+R_{c}}\).

There are two evaluation metrics for multi-class tasks: Macro F-1 score and Micro F-1 score. We adopt Micro F-1 to compare overall performances between our approach and baselines. The reason is that Macro F-1 will mainly be influenced by imbalance among entity types while Micro F-1 treats all instances equally. Micro precision (\(\mu P\)), Micro recall (\(\mu R\)), and Micro F-1 score (\(\mu F1\)) are defined as the following equations respectively, where \(TP\), \(FP\), and \(FN\) count the number of instances regardless of entity types, \(\mu P=\frac{TP}{TP+FP}\), \(\mu R=\frac{TP}{TP+FN}\), \(\mu F1=\frac{2\cdot\mu P\cdot\mu R}{\mu P+\mu R}\). We have employed the standard threshold setting of 0.5 for discerning positive and negative predictions. The bold in Tables 2–5 represents the best results of the different models or strategies in the experiments. The Python Sklearn and Numpy modules are employed in implementation.

Table 2.

Model	Disease			Symptom
Model	P(%)	R(%)	F1(%)	P(%)	R(%)	F1(%)
CNN+BiLSTM+CRF	73.79	61.34	66.99	48.25	54.37	51.13
Word+Char+Atten	74.23	62.06	67.60	49.17	56.31	52.50
Word+Morpheme+Atten	76.09	62.34	68.53	51.43	59.94	55.36
Word+Morpheme+Atten+CRF	75.03	63.15	68.58	51.58	59.72	55.35
Model	Medicine			Adverse Reaction
Model	P(%)	R(%)	F1(%)	P(%)	R(%)	F1(%)
CNN+BiLSTM+CRF	96.51	98.45	97.47	71.54	84.93	77.66
Word+Char+Atten	98.02	98.80	98.41	78.15	88.79	83.13
Word+Morpheme+Atten	98.02	98.80	98.41	78.15	88.79	83.13
Word+Morpheme+Atten+CRF	97.23	98.64	97.93	78.63	89.23	83.60

Table 2. Results in Type on English Dataset

Table 3.

Model	Disease			Symptom
	P(%)	R(%)	F1(%)	P(%)	R(%)	F1(%)
CNN+BiLSTM+CRF	82.14	79.29	80.69	67.40	65.39	66.38
Word+Char+Atten	84.52	79.04	81.69	70.18	66.72	68.41
Word+Morpheme+Atten	87.29	81.45	84.27	73.36	69.25	71.25
Word+Morpheme+Atten+CRF	86.54	81.57	83.98	73.44	69.93	71.64
Model	Disease			Symptom
	P(%)	R(%)	F1(%)	P(%)	R(%)	F1(%)
CNN+BiLSTM+CRF	95.12	92.79	93.94	73.75	81.11	77.26
Word+Char+Atten	96.51	94.48	95.48	77.59	82.70	80.06
Word+Morpheme+Atten	96.32	93.37	94.82	81.29	86.41	83.77
Word+Morpheme+Atten+CRF	96.58	94.02	95.28	81.15	85.83	83.42

Table 3. Results in Type on Chinese Dataset

Table 4.

Model	English			Chinese
	\(\mu P\)(%)	\(\mu R\)(%)	\(\mu F1\)(%)	\(\mu P\)(%)	\(\mu R\)(%)	\(\mu F1\)(%)
CNN+BiLSTM+CRF	71.37	70.33	70.85	79.17	78.03	78.60
Word+Char+Atten	71.92	71.65	71.79	81.93	79.83	80.87
Word+Morpheme+Atten	74.77	72.66	73.70	84.66	82.17	83.39
Word+Morpheme+Atten+CRF	74.46	73.11	73.77	84.32	82.26	83.28

Table 4. Comparison of NER Models on English and Chinese Datasets

Table 5.

Selection Strategy	mean \(\pm std(\%)\)
	AUC for English	AUC for Chinese
Random	59.63 \(\pm\) 0.53	68.35 \(\pm\) 0.46
Diversity	59.04 \(\pm\) 0.68	67.64 \(\pm\) 0.77
Uncertainty	61.06 \(\pm\) 0.21	69.81 \(\pm\) 0.22
MNLP	61.63 \(\pm\) 0.33	70.92 \(\pm\) 0.41
BR	62.26 \(\pm\) 0.25	71.43 \(\pm\) 0.32
DKF	63.10 \(\pm\) 0.27	72.87 \(\pm\) 0.30

Table 5. AUC of Performance Curves

6.1 Evaluation on NER Models

Competitors. In this part, we just ignore the active learning module and compare our NER model against baselines.

–

CNN+BiLSTM+CRF (CBC) [6], employing CNNs to extract character features.

–

Word+Char+Atten (WCA) [9], using an encoder-decoder structure with attention mechanism.

–

Word+Morpheme+Atten (WMA), our deep model which uses word-morpheme embeddings and adopts attention mechanism in BiLSTM networks.

–

Word+Morpheme+Atten+CRF (WMAC), CRF layer is added to see how it will improve our model.

Table 2 presents results in entity type on the English dataset, and Table 3 showcases results on the Chinese dataset. It is observed that symptom is the most challenging entity type to recognize in both languages. The average F-1 score of 8 symptoms provided by these four algorithms reaches 69.42% on the Chinese dataset, whereas it is only 53.59% on the English dataset. Notably, the discrepancy in the observed performance can be attributed to the difference in data sources. The English dataset encompasses X posts penned by diverse users, mainly encapsulating subjective symptom descriptions or expressing personal sentiments. Conversely, the Chinese dataset comprises clinical records meticulously authored by seasoned medical professionals.

Table 4 provides an overview of \(\mu F1\) performances on English and Chinese corpus, respectively. Models incorporating with attention mechanism exhibit 0.94%–4.11% higher \(\mu F1\) scores than the CBC baseline. The attention mechanism is beneficial as it enables the models to learn the context information and identify the starting and ending positions of entities at a semantic level. Among the models, WMAC achieves the highest \(\mu F1\) score of 73.77% on the English dataset and outperforms both CBC and WMA with a remarkable \(\mu F1\) score of 83.39% on the Chinese dataset.

On the English benchmark, WMA outperforms CBC by 2.85% in terms of \(\mu F1\) score, while on the Chinese benchmark, it achieves a remarkable improvement of 4.22%. The discrepancy in performance can be attributed to the difference in average text length between the two languages. The average length of English texts is around 22, while that of Chinese texts is around 41. The LSTM network in CBC can only encode limited knowledge, and its performance declines as the length of the input sequence increases. In contrast, the attention mechanism can adaptively choose important vectors during the decoding stage and is not influenced by sequence lengths.

Now, we are going to evaluate the performances of morpheme embeddings of different sizes \(d\in\{1,2,3,4,5\}\) based on model WMA. Each time, we concatenate the word embedding with only one kind of morpheme embedding to represent the input word as \(x^{d}_{w}=[e(w),\tau^{w,d}]\). We evaluate models WMA with word-morpheme embedding \(x^{d}_{w}\) on each entity type respectively in Figure 6, where the x-axis marks the morpheme size \(d\). For the English dataset, \(x^{3}_{w},x^{4}_{w},x^{5}_{w}\) perform better than \(x^{1}_{w},x^{2}_{w}\) on all types. \(x^{4}_{w}\) reaches the best F-1 score on Disease, Medicine, and Adverse Reaction, and \(x^{3}_{w}\) reaches the best on Symptom. However, for the Chinese dataset, \(x^{1}_{w},x^{2}_{w},x^{3}_{w}\) perform better than \(x^{4}_{w},x^{5}_{w}\). Intuitively, a single Chinese character conveys richer information than a single English character does. The reason is that there are over 3,000 commonly used Chinese characters but only 52 English letters. But the 3-size morpheme embedding \(x^{3}_{w}\) can perform well on both linguistics.

Fig. 6.

To meticulously assess the efficacy of the morpheme design, we executed a series of ablation experiments. As depicted in Figure 6 for the English dataset, the identification task for symptoms and diseases proved more challenging than that for medicines and ADRs. In the Chinese dataset, the symptom task remained difficult, while disease identification demonstrated generally higher performance. Overall, both WMA and WMAC achieved significantly higher micro-F1 scores than WCA in both the English and Chinese datasets. Furthermore, the micro-F1 scores for all three pipelines were significantly higher in the Chinese dataset compared to the English dataset. These results suggest that the proposed method is more suitable for Chinese corpora and writing styles than for English corpora. Notably, there was no significant difference between WMA and WMAC, indicating that the CRF part is not beneficial. In summary, these results underscore that the module-based integrated mechanism of word-morpheme embedding can remarkably improve the overall performance of NER.

In summary, we compared four SOTA pipelines with our method: CBC, WCA, WMA, WMAC. The architecture of word-morpheme embedding and attention remarkably improved the performance of the NER task. The morpheme module, as a novel design inspired by language writing and composition practices, played a key role in the in-depth understanding of linguistics.

6.2 Active Learning

Competitors. Deep models are initialized with identical \(2\%\) of training set. In each stage, models are trained on selected \(k=64\) samples for 20 epochs with a stochastic gradient descent optimizer. We evaluate Distinct-K Filter and 6 baselines with \(\mu F1\) score on test sets.

–

Random, sampling randomly from the unlabeled data.

–

Diversity, setting all \(\mathcal{U}(t_{i})=1\) to ignore uncertainty.

–

Uncertainty, selecting \(k\) samples with largest \(\mathcal{U}(t_{i})\).

–

BatchRand(BR) [4], state-of-the-art method for BMAL.

–

MNLP [35], state-of-the-art uncertainty-based active learning method for NER.

–

DKF, short for Distinct-K Filter.

–

GPT 3.5 Turbo, an advanced language model developed by OpenAI.

As depicted in Figure 7, we run five trials for each algorithm and record the average results on both English and Chinese datasets. Overall, BR, MNLP and DKF outperformed the random baseline in terms of uncertainty, while the performance of diveristy was even worse than random selection. When 10% of medical sentences were labeled, MNLP performed the best, achieving a \(\mu F1\) score of 39.69% on the English dataset and 49.50% on the Chinese dataset. This uncertainty-based strategy selects samples with more informativeness and updates the deep model quickly. However, when approximately 15% of sentences were labeled, DKF began to surpass MNLP. This is because MNLP was overwhelmed by uncertain or even edge data points and struggled to discriminate among diverse entities effectively.

Fig. 7.

On the English dataset, DKF achieved \(\mu F1\) score of 70% with around 44% percentage of labeled samples, while BR and MNLP required 57% and 61% of labeled samples, respectively, to achieve \(\mu F1\) of 70%. On the Chinese dataset, DKF reached \(\mu F1\) of 80% with 39% labeled samples, whereas BR and MNLP required 54% and 62%, respectively. These results demonstrate that DKF can efficiently save annotation cost by filtering out more redundant samples.

More precisely, Table 5 presents the results obtained from conducting five trials for each selection strategy. The area under the curves (AUC) of \(\mu F1\) was calculated to reflect the performances of the selection strategies. The average AUC (mean) and S.E.M. were also calculated. It is evident that DKF achieved the highest AUCs of 63.10% on the English dataset and 72.87% on the Chinese dataset.

The BR algorithm exhibits a propensity to neglect the crucial aspects of uncertainty and diversity weighting, consequently leading to the selection of samples characterized by elevated diversity yet diminished uncertainty. To demonstrate this, we created a dataset comprising 10 kinds of medicines, with 100 medical texts for each kind. First, we randomly select texts of 1–5 kinds as the unknown part and train our model on the remaining texts (the known part). Subsequently, we employed both BR and DKF selection strategies to select \(k=64\) samples from both parts. The objective was to determine which strategy would choose more samples from the unknown part. After conducting 5 trials, we observed that DKF consistently selected more samples from the unknown part than BR did, particularly when the number of unknown kinds decreased (Figure 8). In other words, BR is more susceptible to being influenced by diversity and assigns less importance to uncertainty.

Fig. 8.

We endeavored to utilize contemporary large language models for NER. Prompt Engineering effectively bridges the gap between sequence labeling tasks and generation tasks [18, 38]. To evaluate the capacity for recognition of new entities, we conducted zero-shot and one-shot experiments using ChatGPT 3.5 turbo (Figure 9). By manipulating partially anonymized data with a temperature of 0 and top\(\_\)p of 1, we observed that ChatGPT initially failed to differentiate symptoms from ADRs when the prompts lacked definitions in the medical domains. Upon providing explicit prompts for distinguishing symptoms and ADRs, the micro-F1 score eventually reached 65.19%. Subsequently, a one-shot learning experiment yielded an improved performance with a micro-F1 score of 78.00%. Nevertheless, the performance of symptoms and ADRs in both zero-shot and one-shot experiments was significantly lower than that of DKF. These results highlight ChatGPT’s robust generalization capacity and usability compared with DKF in practice but also underscore its relatively constrained ability to recognize neologisms beyond its knowledge base. Conversely, DKF demonstrates proficiency in handling the neologism identification task.

Fig. 9.

6.3 Deep Insight into Uncertainty-Diversity Balance

To get a deeper insight into how DKF benefits active learning, we visualize the average uncertainty \(\frac{1}{k}\sum_{t_{i}\in M_{(L)}}{\mathcal{U}(t_{i})}\) and average diversity \(\frac{1}{k}\sum_{t_{i}\in M_{(L)}}{\mathcal{D}(t_{i})}\) of chosen sample batches on the English dataset. The uncertainty-diversity balance of DKF is compared against baseline selections Uncertainty and Diversity in Figure 10. Besides, we dynamically trail the effects of DKF from start to finish.

Fig. 10.

We compute the average uncertainty and diversity of batches in 40 stages on the English dataset. Depicted as a scatter plot in Figure 10, we record the results of 5 trials, totally 200 data points, for Uncertainty, Diversity and DKF, respectively. The Uncertainty selection has an overall 0.9 performance on average uncertainty, almost reaching the limit. It has no attention to the distribution of samples and ranges between 0.24 and 0.68 in average diversity. The Diversity selection reaches 0.58\(-\)0.79 average diversity, but in 71% of the cases it has an average uncertainty lower than 0.4. The reason why it has low performance in Section 6.2 is that Diversity just selects diverse samples even though the current model has learned well about them. Figure 10(b) demonstrates that DKF reaches at least 0.4 in both criteria.

We analyze the dynamic effect of DKF during all training stages in Figure 10(b). The light-colored points indicate early stages, while dark-colored points indicate later stages. During the early six stages when 15% percentage of labeled data is added, DKF selects batches with average uncertainty ranging between 0.49 and 0.87, a gap to that of Uncertainty. So DKF doesn’t perform the best by this time. But during later stages, the average diversity of DKF ranges in \((0.56,0.69)\), overwhelming Uncertainty that ranges in \((0.24,0.57)\). We explain it that during later stages, the deep model needs evenly distributed data to distinguish entities more accurately. DKF provides a sample batch of high diversity and thus surpasses other baselines afterward.

6.4 Case Study

In the case study depicted in Figure 11, the MedNER algorithm adeptly facilitated the extraction of pivotal clinical information from extensive medical narratives, demonstrating the application in streamlining the MedNER algorithm for the efficient extraction of critical clinical data from unstructured medical narratives. This instance involves a patient who had been experiencing chronic symptoms related to HBsAg positivity, experiencing symptoms of fatigue and yellow urine, which included liver protection and enzyme reduction therapies. The medical intervention, captured and analyzed through MedNER, comprised a sequence of treatments that were meticulously documented in the clinical reports. The subsequent medical intervention involved the administration of an intravenous compound glycyrrhizic acid glycoside. This treatment led to the development of scattered red rashes and itching—a noteworthy adverse reaction that was swiftly managed with the cessation of the infusion and the administration of chlorpheniramine maleate tablets, which alleviated the symptoms effectively. This case illustrates the MedNER algorithm’s capability to accurately delineate and categorize complex clinical information into discrete, actionable data points: disease identification (HBsAg positive), symptomatology (fatigue, yellow urine), pharmacological interventions (compound glycyrrhizic acid glycoside, chlorpheniramine maleate), and adverse reactions (skin rashes, itching). Such detailed and structured extraction not only augments clinical decision-making but also enhances the monitoring of patient safety and treatment efficacy. We have successfully applied MedNER in the Food and Drug Safety Administration System in Guangdong Province to automatically generate ADR reports, which greatly improved the efficiency and response speed of drug safety monitoring. This case underscores the significant potential of advanced NLP tools like MedNER in transforming clinical practice by providing a robust framework for the nuanced interpretation of medical text, thereby propelling healthcare towards more personalized and precision-driven paradigms and showcasing the transformative potential of advanced computational tools in enhancing the efficacy of medical treatment and patient safety monitoring.

Fig. 11.

7 Conclusion

We introduce MedNER, an advanced active learning framework tailored for the precise extraction of medical texts, significantly reducing the reliance on extensive manual annotation through a sophisticated deep model approach. Our innovative contributions are twofold: a threshold-based approximation selection approach for Distinct-K optimization problem and a dual metric system that evaluates both uncertainty and diversity throughout the model’s architecture, which providing a balance between the two criteria. Demonstrations on authentic datasets verify that MedNER not only achieves superior performance but also economizes on annotation requirements.

Future directions will extend the capabilities of MedNER into the realms of relation and event extraction within medical texts, aiming at developing algorithms capable of discerning intricate relationships and identifying pivotal events, thereby enriching the clinical decision-making process. This advancement aims to refine the process of medical documentation and seamlessly integrate MedNER with the forefront of healthcare technologies, potentially revolutionizing medical research and patient care. The prospective integration of MedNER with state-of-the-art medical technologies promises to catalyze a transformative shift in healthcare, facilitating a new era where artificial intelligence synergizes with clinical practice to enhance the precision and efficacy of patient care.

Footnotes

For Chinese sentences, we need to do Chinese word segmentation.

http://samr.sfda.gov.cn

https://github.com/thunlp/THULAC-Python

References

[1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations.

Abstract

1 Introduction

2 Preliminaries

2.1 NER Task

2.2 Balanced and Deep Active Learning

2.3 Related Works

3 Framework

4 NER Model

4.1 Word-Morpheme Embedding Module

4.2 Sequence Labeling Module

4.3 Loss-Prediction Module

4.4 Training Loss

5 Active Learning

5.1 Selection by Distinct-K Filter

5.2 Estimating Lower/Upper Bounds of \(\boldsymbol{\theta}\)

5.3 Computing Maximum Independent Set

6 Experiment

6.1 Evaluation on NER Models

6.2 Active Learning

6.3 Deep Insight into Uncertainty-Diversity Balance

6.4 Case Study

7 Conclusion

Footnotes

References

Index Terms

Recommendations

Medical Named Entity Recognition using Surrounding Sequences Matching

Boosted Web Named Entity Recognition via Tri-Training

Hybrid medical named entity recognition using document structure and surrounding context

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations