research-article

Open access

Ensemble Active Learning by Contextual Bandits for AI Incubation in Manufacturing

Authors:

Yingyan Zeng,

Xiaoyu Chen,

Ran JinAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 15, Issue 1

Article No.: 7, Pages 1 - 26

https://doi.org/10.1145/3627821

Published: 19 December 2023 Publication History

PDF eReader

Abstract

An Industrial Cyber-physical System (ICPS) provides a digital foundation for data-driven decision-making by artificial intelligence (AI) models. However, the poor data quality (e.g., inconsistent distribution, imbalanced classes) of high-speed, large-volume data streams poses significant challenges to the online deployment of offline-trained AI models. As an alternative, updating AI models online based on streaming data enables continuous improvement and resilient modeling performance. However, for a supervised learning model (i.e., a base learner), it is labor-intensive to annotate all streaming samples to update the model. Hence, a data acquisition method is needed to select the data for annotation to ensure data quality while saving annotation efforts. In the literature, active learning methods have been proposed to acquire informative samples. Different acquisition criteria were developed for exploration of under-represented regions in the input variable space or exploitation of the well-represented regions for optimal estimation of base learners. However, it remains a challenge to balance the exploration-exploitation trade-off under different online annotation scenarios. On the other hand, an acquisition criterion learned by AI adapts itself to a scenario dynamically, but the ambiguous consideration of the trade-off limits its performance in frequently changing manufacturing contexts. To overcome these limitations, we propose an ensemble active learning method by contextual bandits (CbeAL). CbeAL incorporates a set of active learning agents (i.e., acquisition criteria) explicitly designed for exploration or exploitation by a weighted combination of their acquisition decisions. The weight of each agent will be dynamically adjusted based on the usefulness of its decisions to improve the performance of the base learner. With adaptive and explicit consideration of both objectives, CbeAL efficiently guides the data acquisition process by selecting informative samples to reduce the human annotation efforts. Furthermore, we characterize the exploration and exploitation capability of the proposed agents theoretically. The evaluation results in a numerical simulation study and a real case study demonstrates the effectiveness and efficiency of CbeAL in manufacturing process modeling of the ICPS.

1 Introduction

Industrial Cyber-physical Systems (ICPSs) integrate the cyber and physical worlds, which serve as the backbone of the Fourth Industrial Revolution [19]. By embracing the Internet of Things (IoT), an ICPS interconnects manufacturing equipment with ubiquitous sensors, actuators, and computing units, forming a low-cost, high-availability, and high-accessibility network [73]. The high-speed and large-volume sensing data collected from such a network have advanced many data-driven decision-making methods to support manufacturing efficiency, quality improvement, and cost reduction. For example, artificial intelligence (AI) models such as support vector machine (SVM) and deep neural networks have been employed for quality modeling and process monitoring of fused deposition modeling (FDM) processes [26, 65], Aerosol® Jet Printing processes [59], and so on. However, most AI models are proposed following an offline-training-online-deployment (OTOD) strategy, which is only effective when the quality of the training data set is guaranteed (e.g., training data can provide adequate estimations of the underlying true model of the variable relationships in supervised learning). In practice, various factors can change such an underlying model in manufacturing processes (e.g., the degradation of manufacturing equipment or change of product design), which results in an erratic performance of AI models during their online deployment.

To improve the modeling performance, one can either build a dynamic model [37] or investigate an online model training mechanism to adapt existing models to online data streams. However, constructing a dynamic model highly depends on the prior knowledge of the distribution changing patterns and root causes. Instead of focusing on creating a better model, data-centric AI has been proposed as a more general approach to engineer the data needed to successfully build an AI model [57]. From a holistic view, we envision a resilient AI system to identify and mitigate the performance fluctuation of AI models caused by abrupt changes (i.e., data distribution, learning algorithms, and computational resources), which jointly consider managing the data quality as well as adapting the existing models during the online deployment. Therefore, in this paper we focus on online model training by actively acquiring samples to ensure data quality such that the resilient AI performance can be achieved.

Here, we focus on the supervised learning model as the base learner, where the data quality is considered from the aspect of representativeness (i.e., consistent distribution between the training and testing data sets) and class imbalance [28]. In the offline-training step, the high-quality data set can be defined when there are sufficient representative samples for training, such as sufficient samples for multimodal distributions [30], a distribution of the training samples close to that of the testing samples, and when there are balanced class distributions [11]. However, OTOD cannot support the AI modeling since in the context of high-speed, large-volume streaming data, the data quality of training data sets needs to be evaluated continuously. As a motivation example, Figure 1 demonstrates the multimodal distribution and the imbalanced class of samples collected from a highly personalized FDM process, where the samples are projected to the principle component directions of the input variable space by Principle Component Analysis (PCA). The two clusters are generated from two layers in FDM due to different product geometric designs. The objective of the data analysis in Figure 1 is to use in situ process variables to predict the layer-to-layer binary quality variable, which indicates the surface roughness as a classification problem. Assume that the samples collected until time t will be used to train the model, where the samples from only one cluster have been observed. After time \(t+1\), the streaming samples are from the other cluster, i.e., the FDM goes to the next layer with another design in that slice. The collected training data set is not representative due to the shift of distributions, thus resulting in a sudden decrease of the predictionperformance of the pre-trained model.

Fig. 1.

Motivated by Figure 1, it is important to actively select the streaming samples for annotation which ensures data quality. In earlier studies, Design of Experiments (DoE) was proposed to improve the supervised learning models by generating samples to identify significant variables [24, 36]. Recent efforts in data filtering, either model-based [44] or model-free methods [45, 62] aim at accurately modeling the underlying system by sampling a subset with good representativeness of the population distribution. However, DoE focus on actively generating the data while data filtering methods require a completely collected data set before the selection. Neither method can be directly applied to acquire the streaming data in the ICPS.

To obtain high-quality data, humans also play an indispensable role in the data annotation with their domain knowledge. In particular, the online data annotation requires real-time experimentation and a human-machine interface for domain experts to interact with [63], which is time-consuming and labor-intensive. While automatic annotation methods employ semi-supervised learning methods to annotate the samples by the most confident predictions [40], one potential disadvantage is the deteriorating performance of AI models caused by mislabelled cases [13].

Recognizing the significance of human-in-the-loop, we proposed an AI incubation framework [18], designed to foster interaction between domain experts and AI systems throughout the training and deployment phases. This framework facilitates AI model development through human-AI collaboration, including model structure inspired by human decision-making processes [18], and feature generation based on human visual searching patterns [16]. In this work, we further explore the aspect of training data selection and annotation within AI incubation. Here, the human annotator acts as an “incubator”, labeling the acquired samples for training AI models. Our aim is to reduce human efforts while efficiently improving the learning performance of the base learner.

To create an online data acquisition method, an acquisition criterion needs to be designed to determine whether a sample should be selected for human annotation in order to acquire only informative samples and to provide high-quality training data sets for the base learner. This acquisition decision can be viewed as a dilemma between exploration and exploitation of the input variable space [10]. Here, exploitation is defined as acquiring a sample around the conceptual boundary for boundary learning, whereas exploration is defined as acquiring a sample located in the under-represented region for the input variable space discovery. Exploitation-oriented criteria work well when the base learner can easily detect the important regions [48]. Otherwise, exploration is required for more complex scenarios such as the exclusive XOR problem. For the scenario of motivation example shown in Figure 1, if we concentrate exclusively on the samples near the decision boundary for exploitation, samples from another cluster may be overlooked throughout the streaming process, leading to an inaccurate estimate of the decision boundary and poor performance in under-represented regions. On the other hand, exclusive exploration will easily lead to a base learner with high uncertainty. Therefore, a well-balanced exploration-exploitation trade-off is essential for guiding the online annotation of AI models with complex distribution in the input variable space [46].

In this paper, we propose an ensemble active learning method by contextual bandits (CbeAL) to improve the exploration and exploitation trade-off under various scenarios, thus guiding an efficient and effective human annotation process. In CbeAL, a set of active learning agents with human-designed criteria is incorporated by contextual bandits [9], where a joint acquisition decision is made by the weighted combination of individual decisions. Here, we use “agents” as an umbrella term to refer the acquisition criteria in active learning methods. The candidate active learning agents incorporated are designed to pursue an explicit objective of exploration or exploitation with theoretical justification, respectively. Thus, during the annotation process, the weight (i.e., decision power) of each agent indicates the current tendency for exploration or exploitation, which will be updated dynamically by the bandits solver subject to the historical reward. To improve the learning performance of the base learner, the reward is defined as the usefulness of the acquisition behaviour of CbeAL where acquiring a sample which would be wrongly predicted by the base learner is considered useful. Therefore, the online data acquisition problem can be effectively addressed by CbeAL, which pursues the exploration and exploitation trade-off through the ensemble of a set of active learning agents. In this sense, CbeAL is a generic active learning framework that reduces manual adjustment of active learning agents under frequently changed manufacturing data distributions. In addition to improving learning performance, CbeAL also increases the interpretability of AI models from the data perspective [74], as the weight of each agent in each acquisition step explains whether the sample is annotated for exploration or exploitation.

The remainder of this paper is organized as follows. Section 2 summarizes the related work. Section 3 introduces the proposed CbeAL method and provides the theoretical justification. Section 4 evaluates the performance of CbeAL by simulation studies. Section 5 validates CbeAL via a real case study of online quality modeling of FDM in the ICPS. We conclude this work with some discussion of future work in Section 6.

2 Related Work

2.1 Online Model Updating in Industrial Cyber-physical Systems

In the past decades, the ICPS has integrated physical manufacturing equipment with sensing and actuation networks as well as ubiquitous computational resources, which provides the digital foundation for the online updating of AI models [51, 67]. With the streaming observational data and online computational resources, the online updating techniques of AI models have been investigated to enable the close modeling of manufacturing processes and facilitates the efficient decision-making in ICPSs. For example, Bastani et al. [7] proposed an online classification model for real-time monitoring in additive and semiconducting manufacturing processes. Wang et al. [64] developed a large-scale online multitask learning model to coordinate machine actions in the ICPS online. Online model updating strategies have also been developed for model calibration and predictive maintenance [43, 71]. However, the aforementioned studies focus on developing the online updating algorithm of the AI model via Bayesian methods or distributed optimization methods to reduce the computational burden with large-volume streaming data, which are effective for unsupervised learning problems or supervised learning scenarios with easily collected responses. Yet for many supervised learning scenarios in the ICPS, the passively collected data need to be annotated via real-time experimentation by domain experts, e.g., the inspection of a batch of 400 wafers may take more than 8 hours [35]. The lack of consideration of human annotation efforts renders these online updating methods inefficient for supervised AI models, especially in highly personalized manufacturing environments with rapid product and process changes [2].

2.2 Data Quality and Data Acquisition Methods

Compared to the accuracy and efficiency of learning algorithms, validating and monitoring the quality of data fed to AI models is an equally important problem [14]. Metrics for assessing the data quality for classification tasks include outlier detection, boundary complexity, label noise, shifting distribution, class imbalance, etc. [28]. In the context of streaming data, Caveness et al. [14] developed a data analysis and validation system to monitor significant changes between successive batches of the training data by summary statistics (i.e., mean, variance, etc.) with human investigation for a machine learning pipeline. However, without considering the informativeness of the data related to the AI model, the data collected by such a system cannot effectively improve the modeling performance.

On the other hand, to improve the performance of supervised learning models and reduce human annotation labor efforts, methods have been developed to facilitate effective data acquisition for high-quality informative data. These methods include providing acquisition recommendations for human annotation (e.g., sequential design and active learning) and automatic annotation (e.g., semi-supervised learning) [20], where a limited approach suits the online streaming data. Sequential design focuses on selecting the samples in a sequential manner to achieve certain optimality criteria such as maximum entropy, maxmin distance [41, 56]. For example, Yan et al. [72] proposed an adaptive sequential sampling method to balance sampling efforts between the exploration and exploitation of anomalous regions for anomaly detection in the ICPS. However, these methods provide active recommendations that require experiments to be conducted at selected points in the input variable space, which cannot address passively collected data. Semi-supervised learning has been employed to automate the annotating process such that the base learner can learn from both labeled and unlabeled data [75]. However, adding mislabelled cases by a semi-supervised learner to the training set may hamper the base learner’s learning performance [13]. Due to the aforementioned limitations, we focus on active learning methods which provide acquisition recommendations for the passively collected data. Active learning has been leveraged for minimizing the human effort as well as improving the modeling performance in various applications including human activity recognition [1], threatening surveillance event detection [47], wearable sensing platforms annotation [55], and so on.

2.3 Exploration and Exploitation in Active Learning

Active learning reduces annotation efforts for supervised learning models by evaluating the informativeness of samples and acquiring the most informative ones [52]. The decision on whether one sample should be labelled can be viewed as a dilemma between the exploration and exploitation of the input variable space. In the earlier work, most efforts exploit samples with large amount of information about the base learner as an exploitation-oriented strategy. Metrics such as classification uncertainty [42], margin [5], and entropy [25] of the base learner have been adopted to measure the informativeness and compared with corresponding thresholds to make the acquisition decision. To achieve exploration for online streaming data, Ienco et al. [34] modeled the local density of a sample to acquire the sample lying in a dense region with a small classification margin. For trivial scenarios where only parts of the input variable space have to be known in order to perform optimally, exploitation-oriented acquisition criteria can be more effective to avoid exploring regions that are irrelevant for the decision boundary estimation [60]. However, in nontrivial scenarios, exploration is crucial to uncover relevant, unknown regions when the base learner’s estimation of the decision boundary is imprecise. Thus, the exploration-exploitation trade-off becomes vital under scenarios with exclusive XOR problem, clusterwise structure, imbalanced class distribution, and the like [46, 48]. Relying solely on either exploration or exploitation falls short in achieving optimal learning outcomes due to incompatibility with varied online annotation scenarios. To achieve a compromise, a common acquisition strategy is to conduct exploration and exploitation simultaneously. Considering two acquisition criteria which are dedicated to exploration (e.g., random sampling) and exploitation (e.g., uncertainty sampling) respectively, the compromise can be achieved by selecting one criterion with a certain probability for each streaming sample. One typical example is the \(\epsilon\)-greedy policy which enforces the input variable space exploration with probability \(\epsilon\) in each round [60]. Representative sampling methods have also been designed as the combination of exploitation-oriented and exploration-oriented criteria [68]. However, the ambiguity of the weight of each objective requires further fine-tuning for each learning scenario. To avoid ambiguity, Loy et al. [46] extended the Query-by-Committee (QBC) [53] paradigm to a nonparametric Bayesian model to address unknown class discovery and imbalanced class distribution for the online annotation. However, without taking into account the informativeness (i.e., uncertainty) of a sample about the base learner, the proposed QBC-PYP cannot adjust the exploration-exploitation to the learning performance of the base learner. In brief, the simple combination of active learning criterion cannot handle challenging online data acquisition scenarios even with fine-tuning, which is due to the lack of compatibility with the data stream and the lack of adaptiveness to the learning performance of the base learner [22]. Therefore, an ensemble of multiple criteria is desired to guide the dynamic exploration-exploitation trade-off in an adaptive and data-dependent manner.

As a promising approach, active learning has been recently formulated in the framework of reinforcement learning (RL) and multi-armed bandits where the objective is to learn the optimal acquisition criterion as a policy to maximize the cumulative reward [21]. However, these methods can also lack systematic and explicit considerations for both exploration and exploitation objectives. Wassermann et al. [70] proposed Reinforced Active Learning (RAL) which modeled the stream-based active learning as a contextual bandits problem. In RAL, a set of base learners was gathered as the committee to provide acquisition advice based on the certainty degree of the sample to each learner. The acquisition criterion can be viewed as the weighted combination of different uncertainty sampling policies. In spite of its adaptiveness to the data stream, RAL is highly exploitation-oriented since it mainly focuses on decision boundary learning for each learner. Baram et al. [6] first proposed COMB to blend multiple acquisition criteria as experts and consider samples as arms in multi-armed bandits. Later, Hsu and Lin [31] refined COMB with ALBL using the bandits analogy, treating acquisition criteria as experts. While the bandits framework allows for dynamic adjustment of the exploration-exploitation trade-off, neither method provided guidance on selecting criteria nor addressed the explicit objectives of exploration and exploitation. A random selection of general acquisition criteria with a small size may not well address different online annotation scenarios, while a large size of experts may cause problematic performance of the bandits solver.

3 Methodology

To develop the active learning agents for CbeAL and derive the theoretical characterization of the agents, we make the following assumptions: (i) The sample size of the initial training set \(\mathcal {D}_0\) is not large enough to guarantee satisfactory modeling performance and the samples in \(\mathcal {D}_0\) are not uniformly distributed in the input variable space. (ii) The streaming data have highly imbalanced class distribution. (iii) There are multiple clusters in the input data distribution. One common example is that the input data follow a Gaussian mixture distribution. This assumption is validated by the simulation setup and validated in the case study. Note that the proposed CbeAL framework is designed for general online annotation scenarios and does not require the assumptions on the input data distribution.

3.1 Overview of the Proposed Methodology

Consider the online data annotation scenario with a sample \(\mathbf {x}_t\) collected at time \(t, t=1,2,\ldots , T\), where \(\mathbf {x}_t \in \mathbb {R}^p\) is the input for the base learner (i.e., the classification model) \(f_t\). We assume that the classification problem has c classes, and \(y_t \in \mathcal {C}=\lbrace 1, 2,\ldots , c\rbrace\) is the label of the sample \(\mathbf {x}_t\). Denote the labelled data pool at time t as \(\mathcal {D}_t=\lbrace (\mathbf {x}_1, y_1),\ldots ,(\mathbf {x}_{n_t}, y_{n_t})\rbrace\) with \(|\mathcal {D}_t| = n_t\). The base learner \(f_0\) is pretrained by an initial \(\mathcal {D}_0\), which contains a limited number of labelled samples. Under the aforementioned setting, we propose an active learning strategy to make the acquisition decision [69]. The strategy is applied with (i) a data source from which one unlabelled sample \(\mathbf {x}_t\) streams at each time stamp without a cost, (ii) a labelled data pool \(\mathcal {D}_t\), (iii) human annotators who can provide the label of a sample \(\mathbf {x}_t\) if an acquisition decision is made to acquire it, (iv) a proposed ensemble acquisition method, and (v) the base learner \(f_t\) to be updated online with the lableled data set \(\mathcal {D}_t\). We have a budget of B samples forannotation during the streaming process.

As an overview (Figure 2), our key idea is to ensemble the acquisition decisions made by the exploration- and exploitation-oriented agents and adaptively balance the two aspects based on the context of incoming samples, the learning performance of the base learner, and the historical performance of the agents. During the online annotation process, at each time point t, (i) we receive a sample \(\mathbf {x}_t\). (ii) Afterwards, one can obtain the predicted label \(\hat{y}_t\) and the side information, such as the predicted probability \(\mathbf {P}^f(\hat{y}_t|\mathcal {D}_t)\) of each class from \(f_t\) and take \(\mathbf {x}_t\) and other information as the context input for the proposed contextual bandits solver Exp4.P-EWMA. And (iii) we make the acquisition decision as a weighted majority of the decisions obtained from the set of candidate agents \(\lbrace AG1, AG2, \ldots \rbrace\). If the decision is to acquire the sample, we acquire \(y_t\) from human annotation and obtain the reward \(r_t\), update CbeAL with \(r_t\), and retrain the classifier with \((\mathbf {x}_t, y_t)\); otherwise, we pass this sample without annotation. The advantage of the proposed framework lies in three aspects: (i) CbeAL explicitly pursues the input variable space discovery and the decision boundary learning via incorporating exploration- and exploitation-oriented agents while it balances the overall exploration-exploitation trade-off adaptive to the data stream and the learning performance of the base learner by contextual bandits. (ii) The systematic ensemble of multiple pairs of agents save the efforts for agents selection under various learning scenarios with different input data distribution, feature dimension, signal-to-noise ratio, and so on. Therefore, CbeAL is developed as a generic active learning framework to achieve an adaptive and well-balanced exploration-exploitation trade-off for the incubation of classification models with streaming data. (iii) CbeAL is scalable in terms of the number of active learning agents to enhance exploration or exploitation.

Fig. 2.

3.2 The Ensemble Active Learning by Contextual Bandits

During the online annotation process, the needs for exploration and exploitation change over time, which depend on the observed samples, the incoming sample, and the updated learning performance of the base learner. Since there does not exist a consistent optimization criterion to adjust the trade-off, the shift between the two aspects is nontrivial.

To address the challenge of achieving a good exploration-exploitation trade-off adaptive to various online annotation scenarios, we formulate the shift between two objectives as a contextual multi-armed bandits problem. In the bandits problem, the bandits solver needs to make the decision of pulling one of the K arms as the action for time point t based on the received contextual information. With each arm characterized by an unknown reward distribution, the objective of the solver is to gain the highest cumulative reward \(R = \sum _{t=1}^{\infty }r_t\). Under the framework of CbeAL, we consider the decision to acquire or not acquire a sample as the arms and propose to ensemble exploration- and exploitation-oriented agents as candidate policies (i.e., experts). The acquisition decision (i.e., decision of pulling one arm) is jointly made by a weighted combination of the individual decisions from each agent. Thus, the adjustment of exploration-exploitation trade-off is converted to the selection among different types of agent with the goal of gaining a higher reward, which can be solved by some well-developed bandits solvers.

In CbeAL, the arm pulled at time t (i.e., \(a_t \in A = \lbrace 1, 2\rbrace , |A|=K=2\)) represents the overall acquisition decision, where \(a_t=1\) refers to acquiring the sample \(\mathbf {x}_t\) and otherwise \(a_t=2\). At each time point t, the incoming sample \(\mathbf {x}_t\), prediction \(\hat{y_t}\), and the predicted probability \(\mathbf {P}^f(\hat{y}|\mathbf {x}_t) \in \mathbb {R}^c\) obtained by the base learner \(f_t\) are considered as the observed contextual information that affects the exploration-exploitation trade-off. Each incorporated active learning agent (\(AG_i, i\in \lbrace 1, \ldots , N\rbrace\)) makes its own decision \(\mathbf {\xi }^i_t \in \mathbb {R}^K\) based on these contextual information \(\lbrace \mathbf {x}_t, \hat{y_t}, \mathbf {P}^f(\hat{y}|\mathbf {x}_t)\rbrace\). Here \({\xi }^i_{a,t}\) represents the probability of the i-th agent taking the a-th action. Specifically, we have the decision vector \(\mathbf {\xi }^i_t = [p^i_t, 1-p^i_t]\), where \(p^i_t\) is the acquisition probability for sample \(\mathbf {x}_t\). Simultaneously, each agent is assigned with a decision power \(\alpha _{i, t}\) at time t. The overall decision is a majority voting of all agents’ decisions weighted by their decision power, which leads to a decision vector \(\mathbf {P}_t\in \mathbb {R}^K\). If the overall decision asks for the ground-truth label, a reward \(r_t\) is received after execution. Afterwards, the proposed bandits solver Exp4.P-EWMA updates the decision power of each agent based on its decision \(\mathbf {\xi }^i_t\) in this iteration and the reward \(r_t\). With the objective of gaining a high cumulative reward, the ensemble of agents is the same as to combine the decisions made by each agent such that the reward gained in each iteration is close to the highest we can get from the best agent in the agent set. The execution of CbeAL is summarized as Algorithm 1.

Notably, the online updating of the base learner is not the focus of this study and we simply retrain the base learner based on all annotated samples from \(\mathcal {D}_{t+1}\). For a more efficient updating, online learning algorithms, such as first-order algorithms [76] and Bayesian-based approaches [15], can be adopted depending on the base learner.

To integrate the goals of active learning and multi-armed bandits in order to provide informative acquisition, the design and characterization of the reward are critical. We define the reward \(r_t\) as suggested in [70]:

\begin{equation} r_t = {\left\lbrace \begin{array}{ll} \rho _{}^+,& \text{if } \hat{y_t}\ne y_t\\ \rho _{}^-, & \text{if } \hat{y_t} = y_t. \end{array}\right.} \end{equation}

(3.1)

The reward \(r_t\) can only be obtained if the overall decision made by CbeAL acquires the ground-truth label. Otherwise, it is zero. Intuitively, the acquisition action will be rewarded if the base learner would have made a wrong prediction, otherwise it will be penalized since this acquisition is considered unnecessary. Therefore, it measures both the informativeness and the usefulness of an acquisition decision. Based on this design, the reward of acquiring a sample is determined by the performance of the current base learner \(f_t\), the incoming sample \((\mathbf {x}_t, y_t)\), and also the performance of the bandits learner CbeAL. Thus, the sequence of reward \(\lbrace r_1, r_2, \ldots , r_t\rbrace\) is autocorrelated. This characteristic is another reason that we adopt the setting of adversarial bandits [3], where one active learning agent (\(AG_i\)) is considered as one expert and decisions from each expert are simultaneously considered to make a joint decision, since no statistical assumption is made on reward generation in this setting. Additionally, instead of considering one agent as one arm, incorporating agents as experts also makes the number of agents scalable. In summary, with the designed setting of contextual information and the reward, the updated decision power of each agent adjusts the exploration-exploitation trade-off to improve the learning performance of the base learner.

To solve the formulated contextual bandits problem in CbeAL, we propose the Exp4.P-EWMA solver, where we embed a control chart-based flipping mechanism to Exp4.P[9]. To balance the overall exploration-exploitation behaviour, the exploration- and exploitation-oriented agents are incorporated in CbeAL by pairs, which can be easily adjusted for a specific online annotation scenario. However, with the pair ensemble, the direct application of Exp4.P can easily lead to a dominant agent (i.e., an agent consistently has the highest decision power) from an early stage, which makes CbeAL act no difference from a single active learning agent dedicated to exploration or exploitation. This can be expected since the pure exploration strategy in the early stage may cause the acquisition of samples with low uncertainty, so that the decision power of exploration-oriented agents keeps decreasing until a level too low to contribute to the overall acquisition decision any longer. To avoid the early convergence in Exp4.P, a control chart-based flipping mechanism is integrated into the solver. Denote the standardized weight (i.e., standardized decision power) of the i-th agent at time t as \(\alpha ^s_{i, t}\), where \(\alpha ^s_{i, t} = \frac{\alpha _{i,t}}{\sum _{i=1}^N\alpha _{i,t}}\). We monitor each standardized weight by an EWMA chart [33], which detects weight drift over time. The intuition is that if the decision power of one agent keeps decreasing or increasing from the beginning, the decision power of all pairs of agents will be flipped so that the agents with lower power have more chances to lead the decision in the following period. Note this forced-exploration phase will only happen in a short period during the whole process, which is controlled by a hyperparameter \(\gamma\). Denote the weighting factor for EWMA as \(\lambda\), the size factor of shift to detect as h, and the estimated variance of \(\alpha ^s_{i, t}\) as \(s^2_{i, t}\). With the flipping mechanism, the proposed Exp4.P-EWMA solver is detailed as follows:

As listed in Algorithm 2, at each time point t, the solver will first execute Exp4.P to make the acquisition decision \(a_t\) and update the decision power of each agent \(\alpha _{i, t+1}\) based on its decision vector \(\mathbf {\xi }^i_{t}\), the final decision probability \(\mathbf {P}_t\), and the reward \(r_t\). In the second step, the flipping will be triggered if the standardized weight of any agent is outside the updated control limits.

3.3 The Exploration and Exploitation Agents

To balance the exploration-exploitation trade-off under different online annotation scenarios, we design distinguished active learning agents with exploration or exploitation objectives to be incorporated into CbeAL so that a systematic approach is developed without ambiguous selection. Another advantage of this design is the tendency for exploration or exploitation can be directly implied by the decision power of different types of agent.

3.3.1 Low-density Based Exploration Agent (LD-Agent).

The objective of exploration is to identify the structure of the input data distribution during the learning process. Two types of agents are proposed to encourage the exploration of the input variable space. The first type adopts a density-based criterion, which encourages the labelling efforts around each cluster boundary to discover new clusters by annotating samples lying in a sparse region with low density. We adopt the idea in [34] to model the density of a sample.

Denote the set \(\mathcal {W}\) as a sliding window of L previously observed samples, \(d(\cdot , \cdot)\) as the distance between two samples. Denote MaxDist as a function, where \(MaxDist(\mathbf {x}_i, \mathcal {W})\) returns the maximum distance between \(\mathbf {x}_i\) and other samples in the sliding window \(\mathcal {W}\). To approximate the local density for a new coming sample, we define local sparsity (i.e., low-density) factor of a sample \(\mathbf {x_i}\) as the number of times \(\mathbf {x}_i\) is the farthest away from other samples in \(\mathcal {W}\) as follows:

\begin{equation} lsf(\mathbf {x}_i) = \sum _{\mathbf {x}_j \in \mathcal {W}}\mathbb {I}\lbrace MaxDist(\mathbf {x}_j,\mathcal {W})\lt d(\mathbf {x}_i, \mathbf {x}_j)\rbrace . \end{equation}

(3.2)

Algorithm 3 provides the pseudocode to acquire samples with lower local density. Given a streaming sample \(\mathbf {x}_t\) at time t, low-density based exploration agent first calculates the local sparsity factor \(lsf(\mathbf {x}_t)\) to determine the acquisition probability \(p_t\) as the output. Then, the sliding window \(\mathcal {W}\) and the maximum pairwise distance between each sample in \(\mathcal {W}\) will be updated. The sliding window mechanism is adopted to adjust the approximated density based on the most recent data stream. Note the window length L, and the sparsity fraction \(\delta _L\) are hyperparameters that affect the acquisition probability, which can be tuned to best suit the scenario.

3.3.2 Space-Filling Based Exploration Agent (SPF-Agent).

The second type of exploration-oriented agent is based on a space-filling criterion. In the DoE literature, space-filling designs are applied to fully explore the response surface of computer experiments [54]. Therefore, as an alternative strategy to explore the input variable space, a space-filling based exploration-oriented agent is developed to acquire samples uniformly distributed in the space. We adopt the idea of minimum pairwise distance criterion [38] and propose a corresponding criterion to minimize the pairwise distance between acquired samples during the online data acquisition.

Similarly, a sliding window \(\mathcal {W}\) keeps the most recent L samples. Denote MinDist as a function where \(MinDist(\mathbf {x}_i, \mathcal {W})\) returns the minimum distance between \(\mathbf {x}_i\) and all samples in \(\mathcal {W}\).

In Algorithm 4, with a coming sample \(\mathbf {x}_t\) at time t, its minimum distance from the samples in \(\mathcal {W}\) is compared with the largest minimum pairwise distance of samples in \(\mathcal {W}\) to obtain the acquisition probability, leaving a higher probability for samples distant from the observed ones in \(\mathcal {W}\).

Intuitively, the density-based criterion will explore the boundary of the input variable space faster at an early stage, whereas the space-filling criterion allows for a more uniform exploration during the process. The combination of two exploration criteria will enhance the compatibility and adaptiveness of CbeAL to various learning scenarios. In practical, an \(\epsilon\)-greedy policy can also be embedded which forces a sample to be acquired with probability \(\epsilon\) for further exploration.

3.3.3 Reinforced Exploitation Agent (RAL-Agent).

The goal of exploitation in active learning is to capture the decision boundary, which is generally achieved by acquiring samples with ambiguous class membership. To enable the agent to intelligently identify the acquisition demand, we formulate it as an RL problem that aims at learning an adaptive threshold as the optimal policy to maximize the cumulative reward. As suggested by [70], an RL-based controller is designed to adjust the certainty threshold \(\theta\) based on the contribution of historical acquisition decisions. In detail, upon receiving a sample \(\mathbf {x}_t\) at time t, the prediction certainty \(ct(\mathbf {x}_t)=\max {\mathbf {P}^f(\hat{y}|\mathbf {x}_t)}\) obtained by the base learner \(f_t\) is compared with the current certainty threshold \(\theta _t\) to make the acquisition decision. The reward \(r_t\) will be received if \(\mathbf {x}_t\) is acquired, which follows a consistent definition (i.e., \(r_t\in \lbrace 0, \rho ^+, \rho ^-\rbrace\)) as defined in (3.1). Afterwards, the certainty threshold will be updated as:

\begin{equation} \theta _{t+1} = \min \left\lbrace \theta _t (1+\eta \cdot (1-2^{\frac{r_t}{\rho ^{-}}})), 1\right\rbrace . \end{equation}

(3.3)

Note that the threshold will increase slightly with a positive reward and vice versa, which enables a policy adaptive to the decision boundary learned by the base learner. The algorithm of the reinforced exploitation agent is detailed in Algorithm 5.

3.4 Characterization of Agents

To characterize the exploration and exploitation capability of the proposed agents, the variance of the acquired samples \(\mathcal {D}_t\) by one agent is selected as an appropriate metric for assessing its exploration and exploitation activity. A higher variance suggests a learner’s ability to explore the input variable space via acquiring samples in a larger region, whereas a lower variance implies a high frequency of acquisition in a small region for exploitation. To compare the variance of \(\mathcal {D}_t\), the probability of a single sample being acquired by the proposed agents is examined.

We assume that the streaming data belong to a mixture of Gaussian distributions. Denote the previously observed samples stored in the sliding window \(\mathcal {W}\) before time t as the set \(\lbrace \mathbf {x}_1, \mathbf {x}_2, \ldots , \mathbf {x}_L\rbrace\), where \(\mathbf {x}_i\) belongs to the i-th Gaussian distribution (i.e., \(\mathbf {x}_i\sim \mathcal {N}_q(\boldsymbol{\mu }_i, \boldsymbol{\Sigma }^{(i)})\)). Given a streaming sample \(\mathbf {x}_t\) at time t which follows another Gaussian distribution (i.e., \(\mathbf {x}_t\sim \mathcal {N}_q(\boldsymbol{\mu }_k, \boldsymbol{\Sigma }^{(k)})\)), with the Euclidean distance \(d(\mathbf {x}_i, \mathbf {x}_j)=\sqrt {\Vert \mathbf {x}_i-\mathbf {x}_j\Vert _2^2}\), for probability that LD-Agent acquires \(\mathbf {x}_t\), we proved:

Theorem 1.

If the streaming samples follow an independent multivariate Gaussian distribution (i.e., \(\boldsymbol{\Sigma ^{(i)}}=\sigma _i^2\mathbf {I}\)), then there exist \({{\boldsymbol{M}}}_1, {\boldsymbol{M}}_2 \in \mathbb {R}^L\) such that if \(\Vert \boldsymbol{\mu }_i- \boldsymbol{\mu }_k\Vert ^2 \gt M_{2,i}, \forall i \in \lbrace 1,\ldots ,L\rbrace\), then the expected acquisition probability of an LD-Agent \(\mathbb {E}_{\mathbf {x}_i, \mathbf {x}_j, \mathbf {x}_k}[p_t]\) will exceed 1, where \({\boldsymbol{M}}_{1}, {\boldsymbol{M}}_{2}\) satisfies:

\begin{align} &M_{2,i} + \operatorname{erf}^{-1}{(1-2\delta _L)}\cdot \sqrt {2\cdot (4(\sigma _i^2+\sigma _k^2)M_{2,i}+2q(\sigma _i^2+\sigma _k^2)^2)} + (\sigma _i^2+\sigma _k^2)q - M_{1,i} = 0\nonumber \nonumber\\ & M_{1,i}\gt \Vert \boldsymbol{\mu }_i- \boldsymbol{\mu }_j\Vert ^2 + (\sigma _i^2+\sigma _j^2)q + \sqrt {4(\sigma _i^2+\sigma _j^2)\Vert \boldsymbol{\mu }_i- \boldsymbol{\mu }_j\Vert ^2+2q(\sigma _i^2+\sigma _j^2)^2} \nonumber \nonumber\\ &\cdot \left(\Phi ^{-1}\left(1-\frac{1}{L-1}\right)+\gamma \left[\Phi ^{-1}\left(1-\frac{1}{L-1}\cdot e^{-1}\right)-\Phi ^{-1}\left(1-\frac{1}{L-1}\right) \right]\right),\forall i \in \lbrace 1,\ldots ,L\rbrace . \end{align}

(3.4)

This result illustrates that with the increasing of the distance between the center of the distribution of the observed samples and that of the incoming sample, the expected acquisition probability approaches and exceeds 1. This ensures the acquisition of samples from a remote cluster, resulting in an increased variance and, thus, the exploration of the input variable space.

For the reinforced exploitation agent, assume that a logistic regression model is selected as the base learner and at time t the base learner \(f_t\) is parameterized by \(\boldsymbol{\beta }_t\). Given the labeled data pool \(\mathcal {D}_t\) at time t, for the expectation of the probability that the RAL-Agent acquires \(\mathbf {x}_t\), we proved:

Theorem 2.

Given the labeled data pool \(\mathcal {D}_t\) at time t, assume the center of the labeled samples in \(\mathcal {D}_t\) is \(\boldsymbol{\mu }_i\in \mathbb {R}^q\) and the incoming sample \(\mathbf {x}_t\sim \mathcal {N}_q(\boldsymbol{\mu }_k, \sigma _k^2\mathbf {I})\). With the increase of the distance between two centers \(\left\Vert \boldsymbol{\mu }_i- \boldsymbol{\mu }_k\right\Vert ^2\), there does not exist \(M_3 \in \mathbb {R}\) such that \(P\lbrace |\mathbb {E}_{\mathbf {x}_t}[p_{t}] - M_3|\ge \epsilon \rbrace = 0, \forall \epsilon \in \mathbb {R}\).

Since \(p_t\) belongs to \([0, 1]\) for a RAL-Agent, the result implies that the acquisition probability of the incoming \(\mathbf {x}_t\) will not converge with the increase of the distance between the center of the distribution of \(\mathcal {D}_t\) and that of \(\mathbf {x}_t\). Hence, for one sample from a remote cluster, the acquisition decision made by a RAL-Agent does not necessarily lead to an increasing variance.

In summary, the theoretical analysis justifies the exploration and exploitation capability of the proposed agents. Therefore, with the ensemble of two types of agents, the trade-off can be dynamically adjusted to the human annotation process. We also include the theoretical justification on the EWMA mechanism where we prove it does not affect the regret bound of the Exp4.p solver. The proof and numerical study can be found in the supplemental material due to the page limit.

4 Numerical Simulation

4.1 Simulation Setup

Suppose that we have a binary classifier as the base learner that requires online updating. Recall the third assumption that multiple clusters exist in the input variable space. Therefore, we adopt a cluster-based classification data set generation method [29, 49] to generate the input \(\mathbf {X}\in \mathbb {R}^{n\times p}\) and the corresponding label \(\mathbf {y}\in \mathbb {R}^n\), where n is the sample size and p is the dimension of the input variable. We assume that there are two clusters in each class, thus we have \(2\times 2=4\) clusters in total. In brief, the centroids of four Gaussian clusters are first generated as the vertices of one polytope. The input variables are then independently drawn from each Gaussian cluster with unit variance and then multiplied by a random matrix to introduce the random covariance. Then, the samples in two of the four clusters will be assigned with the same label as \(\mathbf {y}\).

To evaluate CbeAL comprehensively, four settings are varied to generate different online annotation scenarios: (i) training sample size n, which includes both the initial training set and the streaming training set; (ii) the percentage of samples in the positive class pc, which determines the balanceness of the two classes; (iii) the percentage of disturbance ds; and (iv) the percentage of sparsity sp, which is defined as the percentage of insignificant input variables among total p input variables. Note that disturbances are added by flipping the labels of randomly selected samples. Additionally, to control the sparsity level, insignificant variables are randomly generated and concatenated to informative ones.

The data set generated for each online annotation scenario is subdivided into three subsets: the initial training set, the streaming training set, and the testing set. The initial training set has a constant size of 20 and the testing set has a size of 500. The number of samples in each class is balanced to be equal in the testing set to better illustrate the classification performance of the base learner. For all simulation scenarios, the budget is set to be \(10\%\) which gives the number of samples available to be labeled as \(B = 10\%\cdot (n-20)\) during the streaming process. All scenarios are replicated 10 times with a randomly generated data set in each replication.

Based on the suggestion in [34, 70] and grid search in simulation experiments, we set the following values for the hyperparameters in CbeAL: \(p_{\text{min}}=\sqrt {\frac{\ln {N}}{KT}}, T = 2000, \delta =0.1, \lambda = 0.3, h=5, \gamma =t/T\), reward \(\rho ^+=1\), penalty \(\rho ^-=0.5\). Meanwhile, three pairs of agents with recommended hyperparameter values are incorporated into CbeAL, forming the set of six agents in Table 1. Note that the hyperparameters can be further tuned for different learning scenarios.

Table 1.

Pair Index	Agent Index	Agent	Hyperparameters
1	\(AG_1\)	\(LD_1\)	\(L=100, \delta _L = 0.01\)
	\(AG_2\)	\(RAL_1\)	\(\theta _0=0.95, \eta =0.005\)
2	\(AG_3\)	\(LD_2\)	\(L=150, \delta _L = 0.005\)
	\(AG_4\)	\(RAL_2\)	\(\theta _0=0.95, \eta =0.01\)
3	\(AG_5\)	\(SPF_1\)	\(L=60\)
	\(AG_6\)	\(RAL_3\)	\(\theta _0=0.90, \eta =0.01\)

Table 1. Agent Set Adopted in CbeAL

To demonstrate the simulation setup, Figure 3 visualizes the generated imbalanced and clusterwise input data of a toy example with 2 input variables (i.e., \(p=2, n=500, pc=10\%, sp=0\%, ds=0\%, B=48\)). The logistic regression model with default hyperparameters is selected as the base learner [39, 49]. The set of blue lines in Figure 3 traces the base learner’s evolving decision boundary with samples acquired by the candidate agents and CbeAL. The line’s color depth indicates time, with the darkest being the final boundary tested for accuracy. The decision boundary estimated with all training data is shown as orange lines. The results of the agents with the best performance in the agent set are presented (i.e., Table 1). CbeAL ensembles one exploration agent and one exploitation agent with the best performance in pairs, marked as CbeAL-2.

Fig. 3.

It is clearly shown that the LD-Agent actively seeks samples around the boundary of the input variable space, whereas the samples acquired by SPF-Agent are more evenly distributed. Both exploration-oriented agents successfully acquire the samples in both clusters of each class, but very limited samples around the ground-truth decision boundary are selected. In regard to the RAL-Agent, although the uncertainty threshold is supposed to be updated adaptively, Figure 3 reveals that it is stuck in one cluster of the positive class while the agent keeps acquiring around the wrongly estimated decision boundary due to the lack of explicit exploration. Combining the two strategies together with the proposed ensemble mechanism, CbeAL-2 acquires samples from both clusters in each class with a focus around the ground-truth decision boundary. This also validates that a well-balanced dynamic trade-off between exploration and exploitation is key to the active learning process with imbalanced and clusterwise input data distribution.

4.2 A Comprehensive Simulation Study

In the comprehensive simulation study, the settings are varied with the following levels: \(n\in \lbrace 500, 1000, 1500\rbrace\); \(pc\in \lbrace 10\%, 5\%\rbrace\); \(ds\in \lbrace 0\%, 3\%\rbrace\); \(sp\in \lbrace 30\%, 70\%\rbrace\). The dimension of the input variable is set as \(p=15\). SVM is selected as the base learner with default parameters [27, 49], which validates the effectiveness of CbeAL as a generic framework for classification models. Note that the prediction probability \(\mathbf {P}^f(\hat{y}_t|x_t)\) of SVM is estimated and calibrated by Platt scaling [50]. Denote CbeAL-2 as the ensemble of the first pair of agents in Table 1 (i.e., \(AG_1\) and \(AG_2\)), CbeAL-4 as the ensemble of the first two pairs, and so on for CbeAL-6. Specifically, CbeAL-6 is proposed as the recommended configuration due to its superior performance enhanced by the ensemble of multiple distinguished agents, which will be detailed in the scalability study.

In this study, CbeAL is firstly compared with the incorporated agents and their variants to study the effectiveness of the ensemble mechanism. Then, we investigate the impact of the number of agents in the ensemble to study the scalability of CbeAL. Finally, as the recommended configuration, CbeAL-6 is compared with four benchmark methods from literature (i.e., uncertainty sampling [42] and random sampling, DBALStream [34] and QBC-PYP [46]) to test the general performance.

In summary, nine benchmark methods are compared with CbeAL-6 where the first three are the candidate agents (i.e., LD, SPF and RAL-Agent), the middle two are the ensemble models with different numbers of agents (i.e., CbeAL-2 and CbeAL-4), and the last four are methods from the literature. Among the benchmarks, RAL-Agent employs an acquisition criterion learned by multi-armed bandits as a cutting-edge AI-guided active learning method; CbeAL-2 and CbeAL-4 adopt the proposed ensemble framework; LD-Agent, SPF-Agent and Random Sampling (RS) focus on the exploration of the input variable space while Uncertainty Sampling (US) [42] caters to exploitation; DBALStream [34] and QBC-PYP [46] are two state-of-the-art composite active learning methods which integrate the objective of exploration and exploitation in their design of acquisition criteria. In detail, in RS, each streaming-in sample is annotated based on a probability derived from the ratio of the budget b to the training sample size n. QBC-PYP and DBALStream are configured with default parameters. US employs an uncertainty threshold of 0.7, determined through cross-validation.

To demonstrate the effectiveness of CbeAL in achieving high learning performance of the base learner with limited budgets, the classification accuracy of the base learner trained by the samples acquired by each method on the testing set is evaluated as a metric. We further investigate the percentage of positive samples acquired during the learning process as another metric to illustrate the exploration-exploitation trade-off in the supplemental material.

4.2.1 Compared with Individual Agents.

First, the comparison of the classification accuracy of the base learners between CbeAL, the incorporated agents, and their variants is shown in Table 2. The performance of the agents that achieve the highest accuracy on average among the exploration- and exploitation-oriented agents in the agent set is selected to be reported as “Opt. Explor.” and “Opt. Exploit.” The results of other individual agents are omitted here for better readability. It is observed that in the toy example, some of the agents do not use up the budget B. To validate that it is a fair comparison with different numbers of acquired samples, we create “Opt. Explor. (Full)” and “Opt. Exploit. (Full)” as two variants where random sampling is used to artificially acquire from the unselected samples after the agents finish their acquisition of the streaming data, until the budget is used up. Besides, the \(\epsilon\)-greedy policy with \(\epsilon =0.01\) is applied to RAL-Agents to effectively improve their learning performance to be a more competitive benchmark [70]. They are also applied to CbeAL methods for a fair comparison.

Table 2.

Disturbance	Percentage of Positive Samples	Method	Training Sample Size			Training Sample Size
Level			Sparsity = 30%			Sparsity = 70%
	Percentage of Positive Samples		500	1000	1500	500	1000	1500

		Opt. Explor.	60.1% (0.02)	61.5% (0.02)	63.3% (0.03)	69.8% (0.04)	72.2% (0.03)	69.7% (0.03)
		Opt. Explor. (Full)	60.1% (0.02)	61.6% (0.02)	63.3% (0.03)	69.8% (0.04)	72.2% (0.03)	69.7% (0.03)
		Opt. Exploit.	58.1% (0.02)	70.2% (0.02)	70.4% (0.02)	67.4% (0.03)	72.6% (0.03)	73.7% (0.01)
	10%	Opt. Exploit. (Full)	58.5% (0.03)	70.1% (0.03)	70.4% (0.02)	68.7% (0.02)	72.9% (0.03)	74.3% (0.01)
		CBEAL-2	58.1% (0.02)	70.3% (0.03)\(^{\mathbf {*}}\)	71.8% (0.03)\(^{\mathbf {*}}\)	67.5% (0.03)	75.9% (0.02)\(^{\mathbf {*}}\)	74.3% (0.03)\(^{\mathbf {*}}\)
		CBEAL-4	61.5% (0.03)	67.3% (0.03)	74.3% (0.02)	69.6% (0.03)	77.4% (0.02)	73.1% (0.03)
0%		CBEAL-6	61.2% (0.02)	73.9% (0.03)	72.5% (0.02)	69.9% (0.03)	73.9% (0.03)	76.7% (0.02)
		Opt. Explor.	52.6% (0.01)	60.8% (0.03)	60.5% (0.03)	59.8% (0.02)	64.3% (0.03)	66.3% (0.03)
		Opt. Explor. (Full)	52.6% (0.01)	60.8% (0.03)	60.5% (0.03)	59.8% (0.02)	64.3% (0.03)	66.2% (0.03)
		Opt. Exploit.	61.0% (0.03)	70.2% (0.03)	64.6% (0.03)	64.5% (0.03)	68.4% (0.04)	70.5% (0.03)
	5%	Opt. Exploit. (Full)	61.1% (0.03)	70.2% (0.03)	64.6% (0.03)	65.2% (0.03)	68.8% (0.04)	70.6% (0.03)
		CBEAL-2	60.6% (0.02)	70.4% (0.03)\(^{\mathbf {*}}\)	65.2% (0.02)\(^{\mathbf {*}}\)	63.9% (0.03)	68.7% (0.04)\(^{\mathbf {*}}\)	67.4% (0.03)
		CBEAL-4	60.7% (0.03)	67.9% (0.03)	64.8% (0.02)	63.5% (0.03)	67.6% (0.03)	66.7% (0.03)
		CBEAL-6	65.3% (0.02)	72.0% (0.02)	66.7% (0.02)	68.9% (0.03)	69.4% (0.03)	69.0% (0.03)
		Opt. Explor.	60.4% (0.02)	60.8% (0.02)	62.1% (0.03)	62.0% (0.02)	62.1% (0.02)	71.7% (0.03)
		Opt. Explor. (Full)	60.5% (0.02)	60.6% (0.03)	62.1% (0.03)	62.0% (0.03)	62.4% (0.02)	71.7% (0.03)
		Opt. Exploit.	68.1% (0.03)	69.3% (0.03)	72.9% (0.03)	68.9% (0.03)	69.2% (0.02)	75.0% (0.04)
	10%	Opt. Exploit. (Full)	60.5% (0.03)	69.8% (0.03)	73.3% (0.02)	68.9% (0.03)	69.3% (0.02)	75.3% (0.04)
		CBEAL-2	67.7% (0.03)	72.2% (0.03)\(^{\mathbf {*}}\)	73.1% (0.03)\(^{\mathbf {*}}\)	65.1% (0.04)	69.3% (0.02)\(^{\mathbf {*}}\)	77.5% (0.01)\(^{\mathbf {*}}\)
		CBEAL-4	68.2% (0.03)	67.0% (0.03)	73.2% (0.03)	66.7% (0.03)	66.4% (0.02)	74.5% (0.03)
3%		CBEAL-6	69.0% (0.04)	71.6% (0.03)	73.8% (0.02)	68.0% (0.03)	70.8% (0.02)	79.4% (0.02)
		Opt. Explor.	53.5% (0.01)	57.6% (0.02)	60.8% (0.03)	54.8% (0.02)	63.9% (0.03)	65.8% (0.02)
		Opt. Explor. (Full)	53.5% (0.01)	57.5% (0.02)	60.8% (0.03)	54.8% (0.02)	63.9% (0.03)	65.8% (0.02)
		Opt. Exploit.	61.0% (0.03)	67.4% (0.03)	69.2% (0.02)	62.5% (0.03)	70.7% (0.03)	77.1% (0.02)
	5%	Opt. Exploit. (Full)	62.4% (0.03)	67.4% (0.03)	69.0% (0.02)	62.1% (0.03)	71.3% (0.04)	78.3% (0.02)
		CBEAL-2	61.5% (0.02)\(^{\mathbf {*}}\)	65.2% (0.04)	65.6% (0.03)	60% (0.03)	72.3% (0.03)\(^{\mathbf {*}}\)	77.9% (0.02)\(^{\mathbf {*}}\)
		CBEAL-4	57.7% (0.02)	63.2% (0.03)	65.5% (0.03)	58.7% (0.04)	72.7% (0.02)	73.5% (0.03)
		CBEAL-6	61.9% (0.02)	68.3% (0.03)	64.5% (0.03)	58.8% (0.03)	74.6% (0.03)	80.1% (0.03)

Table 2. The Average Values and Standard Errors (in parentheses) of Classification Accuracy in the Simulation Study Reported over 10 Replications

Best results (excluding CbeAL-2 and CbeAL-4) are highlighted in bold.

Table 2 summarizes the averages of the classification accuracy and standard errors over 10 replications of the base learner trained by \(\mathcal {D}_t\). It can be observed that the proposed CbeAL-6 outperforms the best individual agent in 20 of the 24 scenarios, which verifies that the ensemble of multiple agents with explicit consideration for both exploration and exploitation can effectively enhance the learning performance under a highly imbalanced class distribution. The advantage on learning performance compared to benchmarks is more significant when there is no disturbance and the class proportion is more balanced (i.e., \(ds = 0\%, pc=10\%\)). However, with a more severe imbalance (i.e., \(pc=5\%\)), “Opt. Exploit.” and its variants sometimes achieve slightly higher accuracy. One possible reason is that under such scenarios, it will be more efficient to only focus on decision boundary learning since the number of positive samples is limited. Besides, the inferior performance of CbeAL-6 under the scenario \(ds = 3\%, pc=10\%, n=500, sp=70\%\) can be caused by the disturbance. It can be found that, in general, the learning performance of the base learner is improved with a data stream with a large size and a higher sparsity. However, when the sparsity is low, the accuracy sometimes decreases as the training sample size increases, which can be caused by the high imbalance and the lack of degree of freedom.

Another finding is CbeAL-2 achieves comparable performance with the better of the two incorporated agents, which implies that the proposed ensemble framework enables the intelligent selection among candidate agents in an adaptive manner. Under 14 out of 24 scenarios, it outperforms both “Opt. Explor.” and “Opt. Exploit.”, where the results are marked with \(^{\mathbf {*}}\).

Comparing the “Opt. Explor.” with “Opt. Explor. (Full)” and “Opt. Exploit.” with “Opt. Exploit. (Full)”, we find that consuming the remaining budget by random acquisition will not make a significant improvement on the learning performance under most scenarios. Sometimes it will select less contributive samples, which leads to a lower accuracy because of the highly imbalanced distribution. Therefore, it validates that the agents have acquired the most informative samples based on their criteria. Thus, this variant will not be considered in the following analysis.

We also find that CbeAL-6 obtains a significantly better balanced labeled data set \(\mathcal {D}_t\) with a higher percentage of positive samples compared to the benchmarks under most scenarios. Detailed results can be found in Table A1 in the supplementary material.

4.2.2 Scalability Study.

It has been observed in the previous results (i.e., Table 2) that CbeAL-6 achieves better performance than CbeAL-2 in general. Here, we further investigate the following two questions: What will be the impact of the ensemble of varying numbers of agent pairs and how should the agents be selected.

Rechecking the results in Table 2, we observe that CbeAL-6 demonstrates a dominant superiority in the learning performance, which indicates the advantage brought by multiple agents. However, comparing the result of CbeAL-4 with CbeAL-2, CbeAL-4 achieves better performance in fewer than half of all scenarios. The counterintuitive result indicates that the ensemble of more agents may not improve the performance. To identify the reason, we investigate the acquisition decision made by each agent in the agent set and their standardized weights in CbeAL-6 under one scenario in Figure 4 where CbeAL-4 showsinferior performance than CbeAL-2 but CbeAL-6 performs better.

Fig. 4.

It can be observed from the bar charts (Figure 4(a)-(c)) that in CbeAL-6, the first two pairs of agents (\(LD_1\) and \(LD_2\), \(RAL_1\) and \(RAL_2\)) make similar acquisition decisions while the third pair behaves differently. As a direct result of homogeneity, the standardized weights of the first two pairs of exploration- and exploration-oriented agents will be close and change in a similar pattern in CbeAL-6 (i.e., Figure 4(d)), which also causes a comparable performance of CbeAL-4 and CbeAL-2. This also explains that the superior performance of CbeAL-6 lies in the heterogeneous decisions brought by the third pair of agents (\(SPF_1\) and \(RAL_3\)). Besides, the weights of agents in CbeAL-6 indicate that at the beginning, the exploration dominates the active learning process. Later, the proposed CbeAL switch its tendency to exploitation, and the exploration capability still remains adaptive to the data stream, which contributes to the effective and efficient online annotation.

In summary, the ensemble of distinguished agents provides comprehensive criteria to evaluate the informativeness of each streaming sample in terms of exploration and exploitation, thus achieving a well-balanced trade-off. Since the acquisition behaviour of one active learning agent varies under different scenarios and there does not exist one overall winner, CbeAL-6 is recommended as a default configuration to solve these challenging learning tasks.

4.2.3 Comparison Study with Benchmark Methods.

Finally, CbeAL is compared with four other benchmark methods (i.e., RS, US, DBALStream and QBC-PYP) and Table 3 summarizes the classification accuracy of the base learner trained by samples acquired by each method.

Table 3.

Disturbance	Percentage of Positive Samples	Method	Training Sample Size			Training Sample Size
Level			Sparsity = 30%			Sparsity = 70%
	Percentage of Positive Samples		500	1000	1500	500	1000	1500

		Initial	51.4% (0.00)	52.3% (0.01)	52.2% (0.01)	51.0% (0.00)	51.1% (0.01)	50.9% (0.00)
		RS	52.0% (0.01)	57.3% (0.02)	55.6% (0.03)	57.4% (0.02)	60.6% (0.02)	62.1% (0.02)
		US	60.6% (0.03)	70.3% (0.04)	67.9% (0.04)	65.6% (0.04)	72.8% (0.04)	72.4% (0.04)
	10%	DBALStream	54.6% (0.01)	60.6% (0.02)	58.1% (0.02)	56.7% (0.01)	60.7% (0.02)	59.4% (0.02)
		QBC-PYP	51.2% (0.01)	67.8% (0.04)	68.1% (0.03)	63.5% (0.04)	70.3% (0.04)	66.1% (0.04)
		CBEAL-6	61.2% (0.02)	73.9% (0.03)	72.5% (0.02)	69.9% (0.03)	73.9% (0.03)	76.7% (0.02)
0%		All Training Data	76.7% (0.01)	78.9% (0.02)	81.5% (0.01)	77.7% (0.01)	81.5% (0.01)	80.7% (0.01)
		Initial	52.5% (0.01)	52.9% (0.01)	51.1% (0.00)	51.7% (0.01)	52.1% (0.01)	51.5% (0.01)
		RS	51.9% (0.01)	53.7% (0.01)	55.1% (0.02)	54.9% (0.02)	55.6% (0.03)	53.8% (0.01)
		US	62.2% (0.04)	70.3% (0.04)	61.6% (0.04)	69.8% (0.03)	66.6% (0.04)	68.3% (0.04)
	5%	DBALStream	54.7% (0.01)	56.3% (0.00)	54.2% (0.01)	54.4% (0.01)	55.2% (0.01)	54.6% (0.01)
		QBC-PYP	53.4% (0.03)	61.9% (0.03)	60.2% (0.03)	58.0% (0.03)	64.0% (0.06)	63.8% (0.05)
		CBEAL-6	65.3% (0.02)	72.0% (0.02)	66.7% (0.02)	68.9% (0.03)	69.4% (0.03)	69.0% (0.03)
		All Training Data	75.1% (0.02)	78.7% (0.01)	78.4% (0.01)	77.7% (0.01)	80.2% (0.01)	80.7% (0.01)
		Initial	51.9% (0.01)	51.3% (0.01)	50.7% (0.00)	51.1% (0.00)	50.9% (0.00)	51.2% (0.00)
		RS	53.0% (0.01)	56.2% (0.02)	57.7% (0.01)	54.3% (0.01)	53.8% (0.01)	60.6% (0.03)
		US	60.7% (0.04)	68.0% (0.04)	70.5% (0.04)	67.6% (0.04)	66.2% (0.04)	71.0% (0.05)
	10%	DBALStream	56.9% (0.02)	60.1% (0.03)	60.7% (0.02)	56.1% (0.01)	57.1% (0.02)	61.3% (0.02)
		QBC-PYP	61.2% (0.03)	56.8% (0.03)	68.8% (0.04)	64.5% (0.04)	63.6% (0.04)	66.4% (0.04)
		CBEAL-6	69.0% (0.04)	71.6% (0.03)	73.8% (0.02)	68.0% (0.03)	70.8% (0.02)	79.4% (0.02)
3%		All Training Data	79.2% (0.02)	78.8% (0.01)	81.0% (0.01)	80.9% (0.01)	76.4% (0.01)	81.1% (0.01)
		Initial	52.2% (0.01)	53.2% (0.01)	54.1% (0.01)	51.5% (0.01)	51.2% (0.01)	51.5% (0.00)
		RS	52.9% (0.01)	54.4% (0.01)	52.8% (0.02)	53.2% (0.02)	54.5% (0.02)	52.6% (0.01)
		US	59.4% (0.03)	67.9% (0.04)	68.0% (0.03)	59.9% (0.04)	62.6% (0.04)	73.9% (0.04)
	5%	DBALStream	54.3% (0.01)	56.0% (0.01)	54.7% (0.01)	55.6% (0.01)	55.1% (0.01)	57.6% (0.02)
		QBC-PYP	62.3% (0.04)	58.1% (0.04)	62.5% (0.04)	60.2% (0.04)	64.2% (0.04)	61.3% (0.04)
		CBEAL-6	61.9% (0.02)	68.3% (0.03)	64.5% (0.03)	58.8% (0.03)	74.6% (0.03)	80.1% (0.03)
		All Training Data	73.4% (0.02)	78.0% (0.01)	76.6% (0.01)	73.1% (0.02)	78.1% (0.01)	80.5% (0.01)

Table 3. The Average Values and Standard Errors (in Parentheses) of the Classification Accuracy in the Simulation Study Reported over 10 Replications

Significant best results are highlighted in bold.

By investigating the results in Table 3, it is concluded that CbeAL-6 achieves significantly better performance compared to the benchmarks under most scenarios. With a limited budget, CbeAL-6 can achieve high classification accuracy close to that of using all the training data with a data stream of larger size and higher sparsity (i.e., \(n=1500, sp=70\%\)).

Considering the benchmark methods, US demonstrates its competitive performance compared to other benchmarks, but with higher standard errors. This indicates the importance of exploitation for the online annotation scenarios, and this also explains the superiority of RAL-Agents in Table 2. However, the lack of adaptiveness to the data stream causes its inferior performance compared to CbeAL-6. The inferior performance of DBALStream might attribute to its concentration on samples with both high local density and large margin, which does not perform effective input variable space exploration. On the contrary, the proposed LD-Agent is able to complete this exploration task. QBC-PYP underperforms in comparison to US, primarily due to its inadequate exploitation capability under a highly imbalanced class distribution. During the streaming-in process, it takes a mixture of Gaussians to quantify the ambiguity of the class membership of one sample. The lack of connection between its acquisition decision and the evolving performance of the base learner results in the lack of effective exploitation.

Additional results of other benchmarks on the toy example are included in the supplementary material, where DBALStream and QBC-PYP outperform US and RS in terms of learning performance. The result implies that US will achieve poor performance if the base learner is confident about its classification result at the beginning with the initial training data.

Overall, the results validate that the proposed method can acquire samples effectively and efficiently in an adaptive manner under various circumstances, confirming the benefits of the ensemble of designed exploration-oriented and exploitation-oriented agents.

5 Case Study

The proposed CbeAL method is applied to an FDM process for online quality modeling and inspection [17], which is introduced as the motivation example in Section 1. During the printing process, various in situ process variables (i.e., vibration, nozzle temperature, etc.) are collected in the ICPS to monitor the process and predict the quality of the FDM part [58]. Here, we focus on the layerwise surface roughness as a binary quality indicator, which is judged and annotated as conforming/nonconforming by domain experts. Figure 5 shows the example normal and rough surfaces of the printed FDM part. To enable real-time quality prediction and online modeling, the in situ measurements are registered and divided into 10-second windows as samples. The online updating of the quality model requires experts to consistently observe and examine the surface roughness during the printing process for the window-wise annotation, which is labor-intensive and time-consuming. Therefore, CbeAL is employed to develop an accurate quality model with less labeling effort and high-quality training data through wisely selecting the samples for annotation.

Fig. 5.

We refer [17] for the details of the data collection. During the process, the in situ extruder vibration, table vibration, nozzle temperature and table temperature are measured and collected in a functional data format. Considering the wavelet analysis applied to the functional measures, the process setting variables (i.e., feed/flow ratio and layer thickness) and summary statistics for each functional measurement (i.e., mean, standard deviation, skewness, and kurtosis), 519 features are obtained in total as the model input. 48 FDM parts are printed successfully in total with 1,588 samples (i.e., windows of measurements). Label the nonconforming samples by 1 and the conforming samples by 0. As a common scenario in quality inspection, we find there are 180 nonconforming samples in total, which implies a highly imbalanced class distribution. We also notice that the positive labels appear in succession during the process (i.e., \(\dots 0000011100\dots\)) because the malfunction of the printer in a period of time will affect the quality of several consecutive layers. Therefore, we maintain the original order of samples in sequence when we separate the training and testing set. A logistic regression model with L1 penalty is adopted as the base learner for the quality online modeling. In brief, the highly imbalanced data stream with patterns in sequence and an underlying multimodal distribution (i.e., Figure 1) brings a challenging online annotation scenario.

We evaluate the classification accuracy of the base learner of the proposed CbeAL-6 under different level of budgets (i.e., \(\lbrace 3\%, 5\%, 10\%, 15\%, 20\%\rbrace\)) with 10 replications. In each replication, 1/3 samples are extracted with a random starting point from the whole data set in time order as the testing data set, with the remaining for training. The first 10 samples in the training set will be used for the model pretraining as \(D_0\). The best performed individual agents in Table 1 and the other four benchmarks in Section 4 are employed for the comparison.

Figure 6 demonstrates the averages of the training and testing classification accuracy of the base learners, where the error bars represent the standard errors over 10 replications. The dash-dotted red line represents the testing accuracy of the base learner trained by all training data and the dotted red line is the standard error. Correspondingly, the testing accuracy of the base learner trained by the initial training data is marked with black lines in a similar way. Here, the result of QBC-PYP is not included since it consistently refuses to acquire samples during the streaming process, which might be caused by the pattern of positive labels in the sequence that requires the method to adapt its exploration-exploitation tendency to the data stream.

Fig. 6.

By investigating the results, it can be observed that the proposed method consistently outperforms its incorporated agents and the rest of benchmark methods in testing accuracy under different levels of budgets. The testing classification accuracy of most benchmark methods has an increasing trend as the budget increases from \(3\%\) to \(10\%\) and then the trend goes smoother, which indicates the samples acquired in the \(10\%\) budget make the most of the contribution to the quality modeling. However, the testing accuracy of CbeAL-6 keeps increasing with a higher budget, which validates the continuous acquisition of informative samples. Notably, the testing accuracy of CbeAL-6 exceeds the accuracy of the base learner trained by all available samples when the budget is higher than \(10\%\). This implies, despite the high dimension, a training set with a smaller sample size but better balanced samples has better quality, thus improving the modeling accuracy. Therefore, the proposed method not only reduces the labelling efforts but also contributes to the online modeling performance via acquiring high-quality samples.

In conclusion, the case study verifies that CbeAL achieves a well-balanced exploration-exploitation trade-off during the streaming process in an adaptive manner, which enables the highly accurate online quality modeling with limited labelling efforts for the FDM quality inspection.

6 Conclusions

While the high-speed, high-volume streaming data brought by the ICPS enhance the data-driven decision-making by AI models, the quality of online data may hamper the modeling performance for manufacturing. To provide resilient AI modeling performance, informative samples need to be acquired from the streaming data to provide a high-quality training data set as well as reducing the human annotation efforts required for AI incubation in an online manner. Existing active learning methods cannot balance the exploration-exploitation trade-off in the challenging online annotation scenario. In this work, we propose an ensemble of exploration- and exploitation-oriented active learning agents as CbeAL. With the ensemble of agents considering each objective explicitly and the proposed Exp4.P-EWMA solver, CbeAL adjusts the exploration and exploitation tendency adaptive to different online annotation scenarios. Simulation studies and the case study in FDM processes demonstrate the advantage of CBEAL over benchmarks in the learning accuracy with a limited acquisition budget.

We notice some limitations of the proposed method. First, CbeAL shows its superiority under learning scenarios with a complex input data distribution. Under other generic scenarios without a demanding exploration-exploitation trade-off, CbeAL may lose its advantage due to the encoded explicit exploration objective. In this case, CbeAL can enhance its concentration on exploitation by adding exploitation-oriented agents or removing exploration-oriented agents. Second, the hyperparameters for the agents are optimized with the grid search. We will formulate the hyperparameter tuning as a meta-learning problem for different base learners as future work [61].

The work leaves us with several future research directions. Firstly, the reward function in CbeAL can be adjusted to quantify regression model performance, such as the root mean squared error of importance-weighted training samples [8]. Concurrently, the exploitation-oriented agent’s reward function could be set as either the base learner’s bootstrap uncertainty [23] or the expected model change [12]. These adjustments extend the framework to generic supervised models. Second, we will investigate formulating the budget as a hard constraint to maintain strict control over the acquisition cost. To realize this, we propose modifying our Exp4.P-EWMA algorithm and integrating it with the knapsack bandit algorithm [4] to effectively address this hard constraint. Furthermore, we consider the ensemble of multiple modalities as the second level actions in CbeAL, enabling the learner to decide not only whether to acquire a sample but also its data source, inspired by [66].

Acknowledgments

The authors acknowledge Dr. Xinwei Deng, Ruojun Wang, and Rou Wen for their comments and suggestions to improve the paper.

Supplementary Material

3627821.supp (3627821.supp.pdf)

Supplementary material

Download
4.96 MB

References

[1]

Rebecca Adaimi and Edison Thomaz. 2019. Leveraging active learning and conditional mutual information to minimize data annotation in human activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3, 3 (2019), 1–23.

Abstract

1 Introduction

2 Related Work

2.1 Online Model Updating in Industrial Cyber-physical Systems

2.2 Data Quality and Data Acquisition Methods

2.3 Exploration and Exploitation in Active Learning

3 Methodology

3.1 Overview of the Proposed Methodology

3.2 The Ensemble Active Learning by Contextual Bandits

3.3 The Exploration and Exploitation Agents

3.3.1 Low-density Based Exploration Agent (LD-Agent).

3.3.2 Space-Filling Based Exploration Agent (SPF-Agent).

3.3.3 Reinforced Exploitation Agent (RAL-Agent).

3.4 Characterization of Agents

4 Numerical Simulation

4.1 Simulation Setup

4.2 A Comprehensive Simulation Study

4.2.1 Compared with Individual Agents.

4.2.2 Scalability Study.

4.2.3 Comparison Study with Benchmark Methods.

5 Case Study

6 Conclusions

Acknowledgments

Supplementary Material

References

Index Terms

Recommendations

INN: An Interpretable Neural Network for AI Incubation in Manufacturing

Contextual Bandits with Cross-Learning

Active Learning for Streaming Data in A Contextual Bandit Framework

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations