Learning from streaming data requires handling changes in distribution, known as concept drift, in order to maintain performance. Adaptive learning approaches retain performance across concept drift by explicitly changing the classifier used to handle incoming observations, allowing changes in distribution to be tracked and adapted to.However, previous methods fail to select the optimal classifier in many common scenarios due to sparse evaluations of stored classifiers, leading to reduced performance. We propose a probabilistic framework, SELeCT, which is able to avoid these failure cases by continuously evaluating all stored classifiers.
The SELeCT framework uses a Bayesian approach to assign each classifier a probability of representing a similar distribution to incoming data, combining a prior probability based on the current state of the system with the likelihood of drawing recent observations. A continuous selection procedure based on the Hoeffding bound ensures that each observation is classified by the classifier trained on the most similar distribution. SELeCT achieves accuracies up to 7% higher than the standard framework, with classifier use matching ground truth concept dynamics with over 90% recall and precision.
This figure shows how the probabilities of each classifier change over time. Notice that recurring concepts trigger resurgences in the probability of the classifier used on the last occurence.
The SELeCT framework retains the representation of concepts as states, and the overall aim of selecting the optimal state to handle each observation, but reformulates drift detection and re-identification as a continuous, probabilistic selection procedure. SELeCT solves the issue of sparse evaluations of states by providing a full probability distribution for all states at every observation and guarantees that the selected state is the optimal achievable state.
For each state, SELeCT computes the probability that the state is optimal for the next observation using a Bayesian framework. A state prior based on current knowledge of the system is combined with a state likelihood based on recent observations. Drift detection is integrated into the prior probability component along with other sources of prior knowledge, such as the current active state. Re-identification is reformulated as calculating the likelihood of drawing recent observations from each state. These components, along with a continuous selection procedure, allow SELeCT to guarantee, within some error bound, that the selected state is the optimal achievable state under a given similarity measure.
- Using python 3.7, install requirements using
pip install -r requirements.txt
- Install SELeCT module with
pip install -e .run inside the SELeCT-master directory.
- (Optional) Place desired datasets into the
RawDatadirectory. This is expected to containRealandSyntheticsubdirectories. These should each contain a directory for each data set to be tested on. The expected format for these is a.csvfile for each ground truth context. The system will work with only a single.csvif context is unknown, but some evaluation measures will not be able to be calculated. For synthetic datasets created with a known generator, an empty directory in theSyntheticdirectory is needed to store files. Each dataset folder should be named the name you will use to call it. The base data directory should be passed as the--datalocationcommandline argument. The dataset name is passed in the--datasetsargument. New datasets will need to be added to the relevent list of allowed datasets inrun_experiment.py,synthetic_MI_datasetsforSyntheticdatasets, orreal_drift_datasetsforRealdatasets. - Run code for demos, discussed below.
- Run main entry point,
run_experiment.py. Set commandline arguments. Most are set to reasonable defaults.
- Set seed using the
--seedscommand. This should be set to the number of runs if--seedactionis set tonew(will create new random seeds) orreuse(will reuse previously used seeds). Or should be a list if--seedactionislist, e.g.--seedaction list --seeds 1 2 3will run 3 runs using seeds 1, 2 and 3. - Set multiprocessing options.
--singlecan be set to turn off multiprocessing for simpler output and ease of cancelation. Or--cpuscan set the desired number of cpu cores to run on. - Set meta-information functions and behaviour sources to disable using
--ifeaturesand--isources. For Quick runs, disablingMIandIMFcan improve runtime significantly.
- Results are placed in
~\output\expDefault\[dataset_name]by default. This can be set with--outputlocationand--experimentname.
- The
results_run_....txtfile contains the calculated performance measures of the run. - The
run_...csvfile contains measures describing how each observation was handled. - The
run_...._options.txtfile contains the options used for a specific run.
The main evaluation entry point is run_experiment.py. This script runs SELeCT on a specified data set.
run_experiment_moa.py runs the same experiment calling a moa classifier on the commandline. Used to test against baselines.
The demo directory gives an example of recreating results from the paper. To recreate the results on the CMC dataset:
- Run the
demo\cmc_testbed.pyfile. This will runrun_experiment.pywith the correct settings to recreate the results from the paper.
- The test by default runs on 45 seeds. To speed up the test using parallel processing, pass the commandline argument
--cpus Xwhere X is the number of CPU cores you wish to use.
- The results of the test will print once done. The individual results for each seed can be found in
output\cmc_testbed\**. - The
--datasetcommandline argument can be passed to run other tests. The other realworld datasets will need the appropriate dataset to be downloaded. Synthetic datasets can be run following the above instructions.
We also include a demo of SELeCT running on our air quality simulation.
- Run the
demo\windSimStream.pyfile. A GUI should appear showing a view of the system over time.