Atm IEEE BIgData-9
Atm IEEE BIgData-9
Thomas Swearingen∗ , Will Drevo† , Bennett Cyphers† , Alfredo Cuesta-Infante‡ , Arun Ross∗ and Kalyan Veeramachaneni†
Abstract—In this paper, we present Auto-Tuned Models, requires numerous hyperparameters to be set before learning
or ATM, a distributed, collaborative, scalable system for a model. A model’s performance can also be judged by many
automated machine learning. Users of ATM can simply upload possible metrics, including accuracy, precision, recall, and
a dataset, choose a subset of modeling methods, and choose
to use ATM’s hybrid Bayesian and multi-armed bandit op- F1-score.
timization system. The distributed system works in a load- It is important to note that this last step is abstracted away
balanced fashion to quickly deliver results in the form of from domain-specific intricacies by relying on a domain-
ready-to-predict models, confusion matrices, cross-validation agnostic input data format, the feature matrix. This makes it
results, and training timings. By automating hyperparameter
a good target for automation. The past few years have seen
tuning and model selection, ATM returns the emphasis of the
machine learning workflow to its most irreducible part: feature the development of an overwhelming number of automatic
engineering. We demonstrate the usefulness of ATM on 420 method selection and hyperparameter tuning algorithms and
datasets from OpenML and train over 3 million classifiers. Our systems1 [1], [2], [3], each purporting to be better then the
initial results show ATM can beat human-generated solutions other in its ability to search the space. Data scientists hoping
for 30% of the datasets, and can do so in 1/100th of the time.
to take advantage of these advances are once again lost in
an enormous option space: which AutoML method should
I. I NTRODUCTION one use?
One of the most common data science problems is that At the same time, several pragmatic data science needs
of deriving a predictive model from a labeled set of raw remain chronically unaddressed. These are: (a) support for
training data. From the beginning, a data scientist working parallel exploration, in which multiple data scientists may
on such a problem is flooded with choices and contingencies. simultaneously search through the model space for the same
The data set may comprise images, text, relational data, or a dataset but different pipelines (or even different datasets), (b)
mix of these types. Pre-processing it often requires a number an ability to extend the model search space, and algorithms
of steps, including data cleaning, feature engineering, and that allow for simultaneously searching through methods and
feature selection, before model training can even begin. their hyperparameters, and (c) abstractions for bringing to-
These pre-processing steps are often domain-specific, and gether these numerous search approaches, so data scientists
the space of possible options for them is vast. Formulating can explore multiple search methods as needed.
the correct set of steps, also known as a pipeline, requires In this paper, we present Auto-Tuned Models, or ATM, a
iteration, collaboration and even verification. multi-user machine learning service that can be hosted on
In most cases, the last step in the pipeline is building and any distributed computing infrastructure, whether a cloud
learning a model, given a feature matrix and labels. This or a cluster. ATM is a multi-method, multi-parameter, self-
last step presents its own choices and challenges, as there optimizing machine learning system for the automation
are now an overwhelming number of ways to learn a model. of classification modeling and hyperparameter tuning. By
Over the years, many methods for classification have been enabling the simultaneous execution of multiple methods
developed, including support vector machines (svm), neural
networks (nn), bayesian networks (bn), deep neural net- 1 which now define a sub-area within the machine learning community,
works (dnn), and deep belief networks (dbn). Each method called AUTO ML
40
and providing a suite of best possible models, ATM returns 31.91
Percentage
the emphasis to the part of the machine learning workflow 30 27.66
Time (Days)
243.6
similar to the way that scikit-learn has demystified and 200
democratized machine learning. Second, it will provide a
100
trusted open-source solution for enterprises to save time and
5 · 10−4 8 · 10−3
resources and bring back focus to the impactful, domain- 0
specific aspects of the data science endeavor: formulating Human-500 Grid GP/Bandit
the precise machine learning task and transforming their data
Figure 2: The average time required for ATM to find
into information-rich features.
a classifier that outperforms the first 500 human-in-the-
Summary of results: We provided ATM with 420 datasets loop attempts. The Human-500 bar shows the average time
downloaded from the OpenML platform and trained models needed for the best-performing classifier to be found by an
for several days. OpenML user in the first 500 attempts.
– ATM generated the largest repository of trained
models: We learned a total of 3 million models across
these 420 datasets, and generated a total of 4 terabytes II. ATM OVERVIEW
of files consisting of models and their performance We describe our system from the point of view of an end
metrics. We will release this data in a structured way user. We imagine ATM drawing various categories of users,
and provide APIs to access it. To the best of our as detailed below:
knowledge, this will be the largest repository of trained Data scientists using ATM can simply upload a dataset,
models available to date. select their desired methods, and choose hyperparameter
– ATM is able to beat human-generated solutions for ranges to search over. 3 They can decide to use either of
30% of the datasets: Once the dataset is available on ATM’s two optimization/model search/selection methods:
the OpenML platform, data scientists can download the (1) a hybrid Bayesian and multi-armed bandit optimization
data, try different classifiers on their local machines, system, or (2) a model recommender system that works by
and upload their results. At the time of writing this exploiting the modeling techniques’ previous performances
paper, we were able to extract the first 500 submissions on a variety of datasets. Multiple users can submit multiple
made to 47 datasets using their APIs. Figure 1 shows the disparate datasets to be run simultaneously. They can also
percentage of datasets ATM was able to beat the best submit a different version of the same dataset – for example,
amongst these first 500 by using simple grid search or one that has undergone a different set of preprocessing and
by an intelligent search technique called GP/Bandit. transformation steps. After submitting the dataset, the user
– ATM is able to perform in 1/100th of the time: can begin getting results (models) via streaming. The system
For every dataset for which ATM beats the human- works in a load-balanced fashion to quickly deliver results
submitted solution, we calculated the time difference in the form of ready-to-predict models, confusion matrices,
between the first and the best submission made by cross-validation results, and training timing. We describe this
humans to the OpenML platform (best within the first in detail in Section V.
500). On average, this difference was 243 days 2 . A AutoML experts can contribute newer search methods by
similar calculation reveals that ATM’s grid search can following the abstractions we define for tuning hyperparame-
generate the solution that beats the best human solution ters and selecting methods. In our current library, we propose
within a few minutes. and use two different methods, and provide abstractions so
that anyone can modify and propose new ones. We describe
2 We note that there was no particular incentive for humans to submit
or enhance solutions. We also acknowledge that we do not know the skill 3 Scaling and dimensionality reduction techniques are available to aid in
level of the humans who submitted these solutions. the discovery process.
Upload Dataset
one metric, even though the user may be more concerned
with a different metric.
Storage
The preponderance of existing work focuses on the
classifier part of the pipeline, which accepts as its input
.. .
Results Worker
Register a set of features with labels. Thornton, et al. released
.. .
..
the first major AutoML system, Auto-WEKA [1]. They
develop the Combined Algorithm Selection and Hyperpa-
Worker rameter (CASH) objective function, and use two tree-based
Bayesian optimization methods (SMAC and TPE) to solve
Figure 3: Overview of the ATM system. A user uploads their the problem. With the goal of creating an AutoML system to
dataset to storage and registers it to the ModelHub database. remove the parameter tuning, they build their system around
The ATM workers query ModelHub for classifiers to run and WEKA [4]. WEKA itself is a suite of machine learning
report the results back to the ModelHub. algorithms intended for a large audience, with the caveat
that parameters must be tuned using a trial and error process,
which may be difficult for people with limited knowledge
this in detail in Section IV
of machine learning to grasp. With the addition of Auto-
Machine learning enthusiasts who know more about each
WEKA, the parameter estimation stage of the ML pipeline
of the individual methods used in our library can enhance
is automated, and machine learning is more accessible to
them by (a) proposing appropriate ranges for their hy-
non-experts.
perparameters and (b) proposing newer methods or better
Feurer, et al. later proposed the auto-sklearn system,
implementations for them.
which makes two substantive improvements to Auto-
The contributions of our work include:
WEKA [2]. First, they add a meta-learning step to initialize
– An asynchronous, distributed system that simulta- the parameters, which they find improves the final classi-
neously generates multiple classifiers (for multiple fication performance. Second, they do not discard models
datasets), records classification results (including per- once trained and tested on the data, but instead store them
formance, models, and meta-information) and logs any and use them all to build an ensemble classifier at the end.
errors as they occur. They replace WEKA with sklearn, which has become the
– An innovative organization of the complex search space dominant machine learning software package. They also
of classifiers. This allows users to extend this space to follow the CASH model developed in Auto-WEKA, but
include newer classification approaches in a structured only use the SMAC Bayesian optimization method, as it
way. performed better than TPE [1].
– A hierarchical algorithm that allows for search at the There have been other approaches to and extensions of
method level and tunes hyperparameters. AutoML. Salvador, et al. propose one such extension, for
– Abstractions to enable the contribution and integration time-varying data [5]. They combine Bayesian optimiza-
of disparate AUTO ML techniques while simultaneously tion with four different adaptive strategies, which leads to
testing them. improved predictive performance in the majority of test
– A database repository of several hundred thousand datasets. Olson, et al. eschew the CASH objective function
classifiers trained on hundreds of datasets. and instead evolve the machine learning pipeline using
– An open source software release at: https://github.com/ genetic algorithms [3]. They demonstrate the “evolved”
HDI-Project/ATM machine learning pipeline’s potential, and show that for
III. R ELATED W ORK some datasets, it can even outperform primitive machine
learning. This is especially interesting because there is no
There are a few popular existing AutoML packages, strict, human-derived pipeline in the problem formulation –
such as Auto-WEKA [1], auto-sklearn [2], and TPOT [3]. instead, the pipeline is learned by the genetic algorithm.
Each package has its own proprietary model search/tuning One drawback of AutoML can be the cost of performing
method and set of hyperparameters. In general, these current multiple runs of the training stage. Some works circumvent
AutoML routines present a few problems. First, they are this issue by using a meta-learning approach to recommend
developed as single-user systems run on local machines, a classifier using a process called algorithm selection [6].
as opposed to a multi-user systems run in a cloud-based There is an offline setup stage where each base-AutoML
environment. Second, they do not track the current state of procedure is run on a set of preselected datasets. A classifier
runs, including previously run experiments, results of those is then recommended for an unseen query dataset. They
experiments, parameters for those experiments, etc. Third, find some correlation between the query dataset and the
they do not organize the solution space in an intelligent, preselected set of databases with performances based on
search-supporting way. Fourth, they focus tuning solely on a few randomly selected classifier/parameter pairs on the
svm svm
c c
Kernel Kernel
poly
Φ
Figure 4: Hyperparameters and hyperpartitions within support vector machines. The leftmost diagram shows the full CPT,
while the subfigures in the center and on the right show hyperpartitions 1 and 2. In hyperpartion 1, degree (φ) is a tunable
parameter that is conditioned on the kernel being polynomial. In hyperpartion 2, γ and intercept are tunable parameters
that are conditioned on the kernel type being sigmoid. Thus, kernel and soft margin are method hyperparameters, while
degree, γ, and intercept are conditional hyperparameters.
query dataset. Abdulrahman, et al. evaluate this type of work method hyperparameters: For a given modeling method,
when there is missing meta information [7]. several method hyperparameters must be specified
There have also been many studies which target specific no matter what. Hyperparameters may be categorical
parts of the AutoML pipeline. Dewancker, et al. improve choices, such as which kernel or which transfer
the AutoML pipeline with more sophisticated acquisition function to use, discrete-ordinal choices like number
functions [8]. Kim, et al. propose candidate parameter sets of epochs and number of hidden layers, and
with a Random Space Partitioning Optimizer [9]. Kim, et continuous valued choices like that of learning rate
al. improve the scalability of a structure discovery algo- and soft-margin.
rithm [10]. Li, et al. discretize the entire hyperparameter conditional hyperparameters: Some hyperparameters
space and then apply a multi-armed bandit scheme to select must only be specified if certain choices for method-
the hyperparameters [11]. level hyperparameters are made. These are conditional
hyperparameters. For example, when a Gaussian
IV. D EFINING THE SEARCH SPACE
kernel is chosen for svm, a value for σ – the standard
In order to explore the space of possible model pipelines deviation of the kernel – needs to be specified.
for a given problem, we must first have a way to pro-
grammatically enumerate the sets of hyperparameters which We define two concepts to help explain the hyperparameter
describe the models. This is difficult, because hyperparam- search space.
eters are hierarchical, and values selected at higher levels Definition 1: A conditional parameter tree (CPT) ex-
affect which choices must be made at lower levels. At presses the combined hyperparameter space for a given
the highest level, the choice of which modeling method modeling method as a search tree. The root node of the CPT
to use affects which hyperparameters must be specified for identifies the model-building method, and each node below
that model. For example, a support vector machine (SVM) the root represents a hyperparameter. Certain hyperparameter
is described by a different set of hyperparameters than a nodes also act as branches, with child nodes beneath them.
decision tree. Furthermore, certain hyperparameters for a Branch nodes are always categorical variables; the value
given model affect which other hyperparameters must be chosen for a branch node determines which conditional
chosen. Consider a data scientist who is training an SVM. hyperparameters below it must be specified. Nodes which
First she must choose which kernel to use. If she chooses do not have children are leaves. Method hyperparameters
a polynomial kernel, she must then specify the value for always descend directly from the root, while conditional
a discrete-ordinal hyperparameter degree. If she chooses a hyperparameters always descend from branch nodes.
sigmoid kernel instead, she must specify values for two A CPT can be traversed to enumerate all possible models
additional continuous hyperparameters, γ and intercept. for a given method, essentially defining its search space.
Therefore, choosing a set of hyperparameters is not as simple Figure 4 shows a typical CPT for a support vector machine.
as sampling from a fixed vector space, even for a given type There are two method hyperparameters – a continuous
of model. We define three levels of hyperparameter: valued soft margin, and a categorical valued kernel. The
methods: At the highest level, a modeling method must kernel is represented as a branch node. The choice of
be selected first. ATM includes support for several kernel determines which conditional hyperparameters must
common modeling methods, including support vector be specified. If kernel is polynomial, degree must be
machines, deep belief networks, random forests, and specified; if sigmoid is chosen, both gamma and intercept
naive Bayes classifiers. Table I lists the supported must be specified instead.
methods. Definition 2: A hyperpartition is a subset of a CPT which
Table I: Table showing the list of methods, their hyperparameters (“m-hyp”), and conditional hyperparameters (“c-hyp”).
The table also shows the number of hyperpartitions per method (“# hyp”). Categorical hyperparameters have the set of
possible values, all other hyperparameters are continuous. Abbreviations used in the table are expanded as follows: SVM ←
support vector machine, RF← random forest, ET ← extreme trees, DT← decision tree, SGD ← stochastic gradient descent,
PA ← passive aggressive, KNN← k-nearest neighbors, LR ← logistic regression, GNB ← gaussian naive bayes, MNB←
multinomial naive bayes, BNB← bernoulli naive bayes, GP← gaussian process, MLP← multi-layer perceptron, DBN←
deep belief network.
Method m-hyp c-hyp # hyp
SVM kernel={linear, rbf, sigmoid, polynomial}, C gamma, coef0, degree 4
RF criterion={gini, entropy}, min-samples-leaf, max-features, min-samples-split, max-depth None 2
ET criterion={gini, entropy}, min-samples-leaf, max-features, min-samples-split, max-depth None 2
DT criterion={gini, entropy}, min-samples-leaf, max-features, min-samples-split, max-depth None 2
SGD loss={hinge, modified-huber, log, squared-hinge}, learning-rate={optimal, constant}, None 48
alpha, fit-intercept={False, True}, eta0, n-iter, penalty={l1, l2, elasticnet}, l1-ratio
PA loss={hinge, squared-hinge}, C, n-iters None 2
KNN weights={uniform, distance}, n-neighbors, algorithm={brute, kd-tree, ball-tree }, p, leaf-size 24
metric={euclidean, manhattan, minkowski, chebyshev}
LR penalty ={l1, l2}, fit-intercept= {True, False}, C, tol 4
GNB None None 1
MNB alpha, fit-prior None 1
BNB binarize, alpha, fit-prior None 1
GP kernel= {constant, rbf, matern, rational-quadratic, exp-sine-squared} nu, length-scale, alpha, periodicity 5
MLP num-hidden-layers= {1, 2, 3}, activation= {relu, logistic, identity, tanh}, layer-1-hidden-size, beta-1, 84
solver= {lbfgs, sgd, adam}, alpha layer-2-hidden-size, beta-2,
layer-3-hidden-size, learning-rate=
{constant, invscaling, adaptive},
learning-rate-init
DBN num-hidden-layers= {1, 2, 3}, epochs, learn-rates, output-activation-function= {softmax, layer-1-hidden-size, layer-2-hidden-size, 12
sigmoid, linear, tanh}, learn-rate-decays, learn-rate-pretrain layer-3-hidden-size,
consists of choices for all branch nodes on a path from the vector machine CPT.
root to the leaves. A hyperpartition fully defines a set of
tunable conditional hyperparameters. V. ATM: AUTO -T UNED M ODELS
These abstractions allow us to break the process for model We present ATM, a distributed, collaborative, scalable
selection into two steps. First is hyperpartition selection, system for hyperparameter tuning and model generation.
in which the CPT is traversed and a value is specified ATM consists of a cluster of workers which train and test
for every branch node encountered. Next is hyperparameter predictive models with different sets of hyperparameters and
tuning. Once a hyperpartition has been chosen, the remaining sync up with a master database, the ModelHub. ATM can
unspecified parameters can be selected from a fixed vector process multiple prediction problems and multiple datasets
space. The vector space may have continuous elements at once. ATM’s worker cluster is persistent: a data scientist
as well as discrete elements, and is likely to have some can decide to upload specifications for a variation of an
combination of both. The process of selecting hyperpa- unfinished problem or a different problem altogether while
rameters from the space defined by a hyperpartition can ATM is running, and workers will adopt the task on the fly.
be optimized with Gaussian Processes and other sampling
A. Components
methods, as described in section VI. An example of a full
set of hyperparameter choices is as follows: ATM comprises the following components:
{ Worker cluster: A worker is a persistent process which
" method " : " svm " , runs on a compute instance (in our tests, an EC2 instance).
" method - hyperparameters " : {
" c " : 0.87 ,
Workers execute model training and communicate with the
" kernel " : " polynomial " rest of the system by appending to a central, immutable
}, log in the ModelHub, as described below. When activated,
" conditional - hyperparameters " : {" degree ": 3}
}
a worker queries the ModelHub for a task to work on. It
chooses an incomplete datarun from the database based on
Figure 4 shows two distinct hyperpartitions for a support the dataruns’ priority scores, then queries for the results of
all classifiers that have already been trained on the datarun. Each hyperpartition is linked to a datarun.
It runs hyperpartition selection to choose a set of partition • classifiers: A classifier represents a model trained
hyperparameters, then runs hyperparameter tuning to select on a dataset with a single, fully-qualified set of hy-
values for the remaining unspecified hyperparameters. Once perparameters. It is also the smallest logical unit of
it has chosen a full set of hyperparameters, it trains and tests work that can be assigned to a worker. Once a classifier
a classifier to produce a model and various performance is finished, model location and metrics location
metrics. It saves the model and the metrics to storage (in point to URLs in the cloud where the model and metrics
our test, Amazon S3) and logs the results to the ModelHub. objects for the classifer are stored. The status column
We implemented all workers in Python 2.7, and we used can be set to ”started”, ”completed”, or ”errored”; if it
the scikit-learn library for specific classifier method is set to the last, an error message is also stored.
implementations. Algorithm 1 presents the algorithmic logic To summarize, when a data scientist wants to solve a
within a worker node. prediction problem with ATM, he or she will upload a set of
ModelHub: The ModelHub is a central database that stores data and create a corresponding entry in the datasets table.
information about data sets, task configurations, hyperparti- They will then create a new datarun which includes con-
tions, and previously-trained classification models, including figurations for budget, priority, and hyperparameter selection
links to the models themselves and information about their and links to the dataset. The script for registering a datarun
performance metrics. Our implementation of ModelHub uses automatically registers each possible hyperpartition based
a MySQL database in conjunction with Amazon S3 to store on the datarun’s configuration. Once the datarun has been
this information. The ModelHub does not perform any actual registered, workers will automatically begin training, testing,
computation on the classifiers; its role is to provide a single, and saving classifiers for the datarun, and will continue
consistent log which all of the workers can query and with until the datarun’s budget has been exhausted.
which they can sync their results. The data structures it holds
are described in more detail below. Algorithm 1 ATMWorker
The ModelHub database contains four tables, each repre-
1: procedure ATMW ORKER
senting a logical data structure. Together they describe all the 2: while True do
classifiers that have been tested for as long as the ModelHub 3: D = ModelHub.GetDataruns()
has been running. Figure 5 shows the database schema, and 4: for d ∈ D do
5: if IsBudgetExceeded(d) then
the tables are summarized below. 6: D =D\d
• datasets: A dataset represents a set of train/test data. 7: end if
8: end for
The table contains its name, paths to train and test 9: if D = {∅} then
data files (if applicable), and the name of the column 10: Sleep() . Wait for dataruns to be added
containing the label. All datasets must be in tabular 11: Continue
12: end if
format. Currently, our implementation only accepts data 13: d = SelectPriorityDatarun(D)
in csv format. 14: ᾱjp ←S EARCH(d.dataset id, ModelHub.classifiers)
• dataruns: A datarun is the logical abstraction for 15: Mp ← L EARN C LASSIFIER(ᾱjp )
a single run of a machine-learning problem. The ta- 16: ypj ← E VALUATE(Mp d.dataset id)
ble stores configuration and settings specified by the 17: ModelHub.StoreResult(Mp , ypj ,ᾱjp )
18: end while
user, including the methods for hyperpartition selec- 19: end procedure
tion (e.g. ”UCB1”) and hyperparameter tuning (e.g.
“Gaussian Process”), and their associated parameters.
Each datarun has an associated budget, which tells the B. Outputs
worker how much work to perform. If budget type
== ‘‘time’’, the column budget amount bounds ATM generates three different types of files during its
the amount of wall clock time workers may spend on operation: data, models, and metrics. We used Amazon’s
the datarun; if budget type == ‘‘classifiers’’, S3 service for storage. Workers and end users can access
it bounds the total number of classifiers which may the S3 bucket with credentials established during the initial
be computed. The priority column tells the workers setup. Labeled data files must be provided by the user, but
which runs to work on first. ATM performs one-time preprocessing and cleaning on the
• hyperpartitions: A hyperpartition is a set of hy- data to convert it into a consistent csv format. For each data
perparameters which define a path through a CPT, and model pair dataset id, classifier id the following
as described in section IV. The table contains the additional elements are stored:
fixed values for the hyperparameters corresponding – Metrics: To enable post-hoc analysis of the models,
to the branch nodes in the CPT, as well as data we store a set of several metrics generated by each
about the unspecified conditional hyperparameters. round of a k-fold cross validation on the available
datasets hyperpartitions
dataset id hyperparition id
name datarun id
train path method
test path partition hyperparameter values
class column conditional hyperparameters
dataruns classifiers
datarun id classifier id
dataset id datarun id
hyperpartition selection scheme hyperpartition id
ts model location
hyperparameters tuning scheme metrics location
r minimum cv judgment metric
priority test judgment metric
start time hyperparameters values
end time start time
budget type end time
budget amount status
error message
Table II: Metrics stored on S3 for each of the classifier finally subtract the standard deviation from the mean.
trained. PR ← Precision-recall, ROC ← Receiver operating This judgement metric is calculated for every fold and
characteristic, AUC ← Area under the curve an average of this statistic across all folds in stored in
Across folds For each fold
the databse.
– Final trained model: After performing cross-validation
Problem Judgement metric Curves Metric
testing on the dataset, the worker trains a final model
Binary F1 PR Accuracy
ROC F1 -score using all the training data available. This model is
AUC-PR serialized as a Python pickle (.pkl) file. The file is
AUC-ROC saved to the cloud and is uniquely identified using a
Multi 3, 4, or 5
hash of the method name, the hyperparameter settings,
µ(F1c ) − σ(F1c ) PR/class F1 /class
ROC/pair AUC/class
and the dataset path.
AUC/pair
VI. AUTOMATIC SEARCH
Big-Multi > 5
µ(F1c ) − σ(F1c ) - Rank-N Combining the search spaces of multiple modeling
F1 /class methodologies and hyperpartitions creates a number of chal-
lenges with regard to finding the best model, either in the
isolation of one methodology or from an hyperpartition. In
training data. This allows data scientists to assess particular, one encounters:
models based on the consistency of their performance
Discontinuity and non-differentiability: Categorical
as well as on their overall performance, and by a
hyperparameters make the search space non-
variety of different metrics. Data sets vary in number of
differentiable, and do not yield to simple search
classes, class balance, and number of training examples,
techniques (e.g. hill climbing) or to methods that rely
which makes it impossible to decide on one universal
on learning about the search space (e.g. Bayesian
“best” metric. While designing ATM, we consulted
optimization approaches).
several data science experts about what metrics would
Varying dimensions of the search space: By definition,
be best. There was no agreement about an ideal subset
conditional hyperparameters imply that the hyperparti-
of metrics, so we decided to include a wide range.
tions within a methodology have different dimensions.
The metrics computed for each classifier are shown in
Because choosing one categorical variable over another
Table II. The judgement metric is a statistic computed
can imply a different set of hyperparameters, the dimen-
across all folds and the average of this number is
sionality of a hyperpartition also varies.
stored in the ModelHub.classifiers database. For
Non-transferability of methodology performance: Un-
Binary class problem we use the F1 score achieved
fortunately, when conducting searches among modeling
for each fold and calculate its average. For multi class
methodologies, robust heuristics are limited. Training
problems, for each fold, (i) we calculate the F1 scores
on the dataset with an svm model gives us no informa-
for every pairing of classes, (ii) calculate the average
tion about how well a dbn might perform.
and standard deviation of these F1 scores and (iii) and
Table III: Notation table. context of ATM, UCB1 treats each hyperpartition as an
Symbol Description
arm with its own distribution of rewards. As time goes on,
the bandit learns more about the distribution, and balances
t = 1...T index into classifiers trained so far, ordered by time
i index into dataset exploration and exploitation by choosing the most promising
j = 1...J index into hyperpartitions hyperpartitions to form classifiers.
j
Yj = y1j . . . yT judgement metric for classifiers tried so far for Concept of Memory: Memory (moving window) strategies
jth hyperpartition
ᾱj a hyperparameter set for a classifier for jth hyper-
exist in order to change the bandit formulation when re-
partition ward distributions are non-stationary. The UCB1 algorithm
ᾱj1 . . . ᾱjT hyperparameter sets for classifiers tried so far for assumes that the underlying distribution of rewards at each
jth hyperpartition arm choice is stationary. In the context of ATM, during
zj reward for hyperpartition j
fj a meta model learned so far for hyperpartition j model optimization, as the meta model (GP) continues
ᾱjp proposed hyperparameters for hyperpartition j to learn about the hyperpartitions and is able to discover
parameterizations that continue to score better and better, it
Algorithm 2 ATM - Search algorithm essentially shifts the bandit’s perceived reward distribution.
The distributions of classifier performance from the set of
1: procedure S EARCH(Dataset ← Di , ModelHub.classifiers)
2: j
G ET judgement metric y1j . . . yT ∀j|Di all parameterizations within that hyperpartition have not
3: for j ∈ J do changed, but the bias with which the GP samples has
4: E VALUATE -R EWARD: zj ← Q(y1j . . . yT
j
) changed. This could effectively lead a hyperpartition to be
5: end for
6: S ELECT HYPERPARTITION:
selected based on stale rewards. Memory strategies have a
s parameter ts that determines how many previously trained
2 ln n classifiers to use to calculate the rewards.
s j = zj + (1)
nj Given the Modelhub data base, a worker first queries and
where j is hyperpartition index, zj is the reward gained from pulling aggregates the cross-validation scores y1j . . . yTj for all the
the j-th arm nj times, and n = J
P
j=1 nj over all J hyperpartitions. classifiers trained for a hyperpartition j thus far, and the
Select j = argmax sj (2) corresponding hyperparameters set for each of the classifiers
j given by ᾱ1j . . . ᾱTj . Treating each possible hyperpartition as
7: For selected j an arm, we next calculate the reward “accumulated” for each
8: G ET hyperparameters ᾱj1...T corresponding to y1...T
j
hyperpartition.
9: if T ≥ rminimum then
10: fj ← FIT((ᾱj1 , y1j ) . . . (ᾱjT , yT
j
) Calculation of reward Q: First a subset of classifiers,
11: ᾱjp ← PROPOSE(fj ) corresponding to the number ts , is selected by either picking
12: else the ts most recent trained classifiers, or the best ts classifiers
13: ᾱjp ← R ANDOM(j) from the ones trained so far for this hyperpartition. These
14: end if
15: return ᾱjp selected ts scores are denoted as ysj .
16: end procedure
Reward based on average: This is the traditional way
to define rewards. The reward is calculated by simply
averaging these scores, given by:
With a conditional parameter space framework in place,
|ys|
we now design automatic meta-learning techniques that 1 X j
iteratively select among hyperpartitions and tune hyperpa- zjT = ys (3)
|ts | i=1 i
rameters. In our system, we use bandit based method to
select amongst hyperpartitions and meta modeling technique with | · | being the cardinal of a set.
to tune hyperparameters for a given hyperpartition. Algo- Reward based on velocity: Velocity strategies seek to rank
rithm 2 presents the combined bandit-based selection and hyperpartitions by their rate of improvement. This is
metamodel-based tuning algorithm. based on the idea that a hyperpartition whose last few
A. Hyperpartition selection using bandit learning evaluations have made large improvements should be
exploited while it continues to improve. To calculate
We employ bandit learning strategies which model each
velocity we first sort scores in ys in ascending order.
hyperpartition as an arm in a multi-armed bandit (MAB)
Using velocity, the reward is changed to:
framework.
Concept of reward: Given that each arm, when pulled, pro- |ys|
vides a randomized reward drawn from a hidden underlying 1 X
z̄jT = ∆ysji (4)
distribution, the goal of a MAB problem is to decide which |ts | i=2
arm to pull to maximize long-term reward, re-evaluating
the decision after each pull’s reward is observed. In the for ∆ysji = ysji − ysji−1 .
Velocity strategies are inherently very powerful be- • compute rewards: Accepts a list of scores pertaining
cause they introduce a natural feedback mechanism to a single choice and returns a list of rewards.
that controls exploration and exploitation. A velocity • bandit select: Accepts a mapping of choices to lists
optimization strategy will explore each hyperpartition of rewards, and returns the choice which it believes will
arm until that arm begins improving less quickly than maximize the expected reward and minimize expected
others, going back and forth between hyperpartitions regret.
and wasting no time on hyperpartitions which are not • select: Accepts a mapping of choices to histori-
promising. cal performance scores, calls compute rewards and
Selection of a hyperpartition: Given the rewards for each of bandit select in sequence, and returns a single rec-
the arm, zj ∀j each arm (hyperpartitions) score is calculated ommended choice.
using Hyperparameter tuning involves taking a set of hyperpa-
s
2 ln n rameter vectors, ᾱ1...T , and a set of corresponding scores,
sj = z̄j + (5) y1...T , and generating a new hyperparameter vector which
nj
is expected to improve on the previously-achieved y-values.
where n is the total number of classifiers trained so far across For this problem, we define a Tuner interface with the
all hyperpartitions for this dataset and nj is the total number following methods:
of classifiers trained for this hyperpartition. Hyperpartition • fit: Accepts historical hyperparameter performance
is selected based on argmaxj sj data (ᾱ1...T and y1...T ) and trains a model which can
predict the performance of arbitrary hyperparameter
B. Hyperparmeter tuning using meta modeling vectors, e.g. a Gaussian process.
Within a hyperpartition, we employ a Gaussian Process • predict: Accepts a set of proposed hyperparameter
(GP) -based meta-modeling technique to identify the best vectors and returns predicted performance for each one.
hyperparameters given the performance of classifiers already • acquire: Accepts the predicted performances of a set of
built for that hyperpartition [12]. To accomplish this, we proposed hyperparameter vectors and returns the index
propose a Fit, Propose abstraction. of the vector which it believes an ATM worker should
Fit: In Fit, a meta model fj is learned between the try next.
• create candidates: Generates a list of random hy-
hyperparameter sets tried so far, ᾱ1j . . . ᾱTj , and their cor-
responding scores, y1j . . . yTj . perparameter vectors, to be passed to predict and then
Propose: In Propose, a new hyperparameter set ᾱp is to acquire.
• propose: Calls create candidates, predict, and
proposed. This usually involves creating candidate hyper-
parameter sets, making predictions using the learnt model acquire in sequence and returns a single proposed set
fj and choosing among those candidates by applying an of hyperparameters.
acquisition function to their predictions. ATM supports Our system includes several implementations of both
two acquisition functions: Expected Improvement, and Ex- Selectors and Tuners by default. However, we also expect
pected Improvement per Time [12]. that AUTO ML experts will want to expand and improve
upon our methods. An expert who wishes to contribute
C. Abstractions in our software can do so by creating a subclass of Selector or Tuner and
modifying one or more of the methods above. He or she can
One of the contributions of our work is to break the
then use our testing functionality to automatically evaluate
automatic machine learning process into two distinct sub-
its efficacy.
problems: hyperpartition selection (via multi-armed bandit
methods) and hyperparameter tuning (via Gaussian process- VII. DATASETS
based methods). While we provide several versions of both
To demonstrate ATM, we use datasets from OpenML [13].
bandit-based selection and GP-based tuning mechanisms,
OpenML is a website4 that allows users to upload datasets
there are many ways these approaches could be improved
for public viewing and track the results of various machine
upon for different circumstances.
learning pipelines. OpenML has over 19,000 datasets avail-
The problem of hyperpartition selection is essentially
able for download. The uploader specifies details about the
one of picking between multiple discrete choices, each of
dataset such as which attribute must be predicted or which, if
which is associated with a set of scores representing its past
any, attributes should be ignored, the authors, citation infor-
performance. Multi-armed bandit algorithms, used for this,
mation, collection date, or license information. Currently, the
usually involve first computing a set of rewards for each
datasets must be uploaded in Attribute-Relation File Format5
arm/choice, then using the set of all rewards to determine
which arm to pull next. To tackle this problem, we define a 4 www.openml.org
Num. Features
ified in OpenML in what are called “tasks.” A task defines 10 5
whether the end result is classification or regression, what
metrics must be returned, and what protocol to use (e.g., a
10 0
classification task would typically include cross-validation). 50 100 150 200 250 300 350 400
Dataset ID
A potential solution for a task is called a “flow” which is
Num. Classes
a machine learning pipeline. An example could be: Data 4
→ PCA → SVM → Report Predictive Accuracy. A flow
2
specifies default values for the hyperparameters of a pipeline.
50 100 150 200 250 300 350 400
OpenML tracks “runs,” which are flows applied to a Dataset ID
task – that is, a run is a fully-specified (i.e., with specific
hyperparameter values) machine learning pipeline applied
Figure 6: Graph showing the number of instances (top),
to a particular task. The hyperparameters may take on the
number of features (middle), and number of classes (bottom)
default value from the flow, or they may be updated by the
in each dataset. The datasets were sorted by number of
user to values he or she thinks are appropriate. This pipeline
instances.
is executed on a user’s personal computing hardware to
generate results as specified in the task (e.g., predictive accu-
racy or F-measure). These results are uploaded to OpenML
classifiers. Since there are 420 datasets, we attempt to learn
through the API. The ability to see previous performance for
420,000 classifiers. Each classifier was trained using 10-
a dataset is especially helpful, so that other users do not have
fold cross-validation, thus our goal is to learn a total of
to rerun flows to determine what does or does not yield good
4.2 million models.
performance. A user simply looks at the performance of
previous flows to generate a potential new flow or different GP+Bandit Run: We run ATM with Best ts Velocity
hyperparmeter values that he or she thinks may improve for hyperpartition selection and Gaussian Process-Expected
performance. Improvement for hyperparameter tuning. For each dataset,
A user can traverse the various runs through the API. A we set a budget of 100 classifiers. Since there are 420
user can search by dataset, flow, or uploader. If we search by datasets, we attempt to learn 42,000 classifiers.
dataset, we can find the best performing run for a particular
IX. R ESULTS AND DISCUSSION
dataset. Since OpenML flows are designed by people, we can
find a human baseline performance to compare our AutoML Figure 7 shows the F1-score histograms of the best-
results against. performing classifiers for all datasets – both for the Grid run
Selecting datasets for our experiment: We only select and the GP+Bandit run. In the GP+Bandit run, we attempted
datasets which (1) are active, (2) have no missing values, to learn 41,647 classifiers across all datasets where 39,342
and (3) have 2 to 4 classes. We exclude datasets which were successfully learned while 2,305 logged errors. For the
have missing values as incomplete-data classification is not Grid run, we attempted to learn 286,577 classifiers across
the goal of this work. After applying our filtering criteria, all datasets where 270,331 were successfully learned while
we downloaded 420 datasets from the OpenML website. 16,246 logged errors. Overall, we are able to achieve good
The datasets fall within a wide range of domains, including classification performance for many datasets.
cancer risk detection, diabetes detection, credit risk detec- Comparison to human baselines: Regardless of whether
tion, car acceptability, chess, tic-tac-toe, and spam detection. a Grid or GP+Bandit strategy was used, the process is
Figure 6 shows additional information about the datasets. automatic, with no human involvement. We next compare
this with a “human-in-the-loop” baseline. These baseline
VIII. E XPERIMENTAL SETTINGS
metrics come from the OpenML site itself. Because datasets
We attempted to run ATM on the datasets using two hosted on this site are publicly available, data scientists can
selection schemes: download the data, try different methods, and upload the best
Grid Run: In this configuration, the hyperpartitions are classifier and the performance metrics they achieved. Using
selected randomly. The hyperparameter range is split into their API, we collected the best possible F1-score submitted
3 values (start of range, end of range, and middle of range). for a dataset by anyone. Figure 8 shows a histogram of the
One of the three values is then chosen randomly for each best ATM performance minus the best human-discovered
hyperparameter. For each dataset, we set a budget of 1,000 performance for 47 datasets (as reported on OpenML).
Best-So-Far F1-score
100 100 0.8
80 80
0.7
60 60
Count
Count
40 40 0.6
20 20 Grid
0.5 GP+Bandit
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
F1-score F1-score 100 101 102
Iteration number
Figure 7: Histogram of the best F1-scores for each dataset
using the Grid configuration (left) and GP+Bandit configu- Figure 9: Comparing Grid and GP+Bandit strategies. The y
ration (right). axis is the average of the best-so-far F1-score achieved after
x iterations for each problem.
20 20
Best-So-Far F1-score
1
15 15
Count
Count
10 10
0.8
5 5
0 0 0.6
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
F1-score Difference F1-score Difference 25 50 75 100
Iteration number
Figure 8: Histogram of the difference between the best F1-
Grid Bandit+GP
score for each dataset and the best “human-in-the-loop” F1-
score from OpenML for 47 datasets. The results for the Figure 10: Comparing Grid and GP+Bandit at specific
Grid configuration are on the left while the results for the iterations. The y axis is the best-so-far F1-score after x
GP+Bandit configuration are on the right. iterations.
Positive values occur when ATM finds a better-performing datasets at a time. The system can assess performance on
classifier than the best “human-in-the-loop” classifier up- a variety of metrics, store metrics and models for future
loaded to OpenML. Negative values occur when ATM does reference or analysis, and present the user with a final,
not find a better performing classifier than the best “human- optimized model to use for prediction. Users can configure
in-the-loop” classifier uploaded to OpenML. ATM to search the vast hyperparameter space using several
Comparing Grid and GP+Bandit: Our next question different strategies, and even to try multiple strategies at
targets whether an intelligent selection strategy that uses once in parallel.
GP+Bandit achieves better result within a given budget of Our main contribution was the design, implementation,
classifiers. To compare both strategies across all data sets, we and testing of the end-to-end system. In addition, we pre-
first calculated a best-so-far F1-score. Figure 9 compares the sented a novel method for organizing the hierarchical search
two selection methods across all datasets using this metric space of machine learning methods. We defined the condi-
(the line represents the average across all datasets). Both tional parameter tree and hyperpartition and demonstrated
selection strategies have a budget of 100 classifiers. We how these abstractions can be used to better traverse complex
see that GP+Bandit has a slight advantage over the random hyperparameter spaces.
strategy, in that it takes slightly fewer classifiers for it to get We designed several different search strategies and gave
to the same result. Figure 10 shows the same information at access to our end user through a simple API. To demonstrate
specific iterations. our system, we ran the largest-ever training experiment,
where we trained an aggregate of several million models
X. C ONCLUSION
across 420 data sets.
In this paper, we presented ATM, a distributed AUTO ML
system that can support multiple users, and search through XI. F UTURE WORK : MODEL RECOMMENDER SYSTEM
multiple machine learning methods. The system is decentral-
ized in that its many worker nodes work independently and Our current work provides a recommender approach, in
asynchronously, only interacting through a shared database. which past ATM results are used to recommend models for a
ATM can learn hundreds of classifiers each for hundreds of new dataset. We briefly describe this below, where we will
attempt to generate models for a dataset using the results R EFERENCES
from the remaining 419 datasets.
[1] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown,
1) Organize results of the Grid Run into a matrix, M of “Auto-WEKA,” Proceedings of the 19th ACM SIGKDD inter-
size 420 × 159, 838 where 159, 838 is the number of national conference on Knowledge discovery and data mining
every possible classifier definition (method and specific - KDD ’13, p. 847, 2013.
hyperparameter values) and 420 represents the number
of datasets on which classifiers have been trained in [2] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg,
M. Blum, and F. Hutter, “Efficient and Robust Automated Ma-
ATM. M will be a very sparse matrix. chine Learning,” Advances in Neural Information Processing
2) M will be split into a row vector p and matrix G. The Systems 28, pp. 2944–2952, 2015.
probe row (p) is of size 1 × 159, 838 while the gallery
matrix G is of size 419 × 159, 838. [3] R. S. Olson and J. H. Moore, “TPOT: A Tree-based Pipeline
3) Sample the probe row p down so it only has 5 values Optimization Tool for Automating Machine Learning,” Pro-
ceedings of the Workshop on Automatic Machine Learning,
and store the new row vector in ps .
vol. 64, pp. 66–74, 2016.
4) Record the best-so-far performance µ = max (ps )
where max (·) reports the maximum value among all [4] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,
the defined values and ignores the undefined
values. and I. H. Witten, “The WEKA data mining software,” ACM
G SIGKDD Explorations, vol. 11, no. 1, pp. 10–18, 2009.
5) Create a recommender matrix R = .
ps
6) Use Soft Imputation to complete the matrix and [5] M. Martin Salvador, M. Budka, and B. Gabrys, “Adapting
estimate values for the undefined values (R0 = Multicomponent Predictive Systems using Hybrid Adaptation
Strategies with Auto-WEKA in Process Industry,” AutoML at
SoftImpute (R)). ICML 2016, no. 2011, pp. 1–8, 2016.
7) Find the top 5 values from the last row in R0 (probe
row). [6] P. Brazdil, C. Giraud-Carrier, C. Soares, and R. Vilalta,
8) Run the classifiers corresponding to the top 5 values in Metalearning: Applications to data mining. Springer Science
Step 7. & Business Media, 2008.
9) Record the classifier performances in their correspond-
[7] S. M. Abdulrahman and P. Brazdil, “Effect of Incomplete
ing column in ps . Meta-dataset on Average Ranking Method,” Proceedings of
10) Update µ = max (ps ). the Workshop on Automatic Machine Learning, vol. 64, pp.
11) Repeat steps 5-10 20 times. 1–10, 2016.