0% found this document useful (0 votes)

12 views12 pages

Guzman 22 A

This document presents a novel adaptive model predictive control (MPC) method that utilizes Bayesian optimization and classifier learning to automatically estimate control parameters under model uncertainty and heteroscedastic noise. The proposed approach integrates a classifier into the model predictive path integral control framework, enhancing the robustness of controllers for various robotics tasks. The authors demonstrate the effectiveness of their method through experiments on classical control problems and robotics manipulation tasks.

Uploaded by

Trân Trân

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views12 pages

Guzman 22 A

Uploaded by

Trân Trân

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Proceedings of Machine Learning Research vol 168:1–12, 2022 4th Annual Conference on Learning for Dynamics and Control

Adaptive Model Predictive Control by Learning Classifiers

Rel Guzman2 REL . GUZMANAPAZA @ SYDNEY. EDU . AU

Rafael Oliveira2 RAFAEL . OLIVEIRA @ SYDNEY. EDU . AU
Fabio Ramos1,2 FABIO . RAMOS @ SYDNEY. EDU . AU
1
NVIDIA, USA, 2 the University of Sydney, Australia

Editors: R. Firoozi, N. Mehr, E. Yel, R. Antonova, J. Bohg, M. Schwager, M. Kochenderfer

Abstract
Stochastic model predictive control has been a successful and robust control framework for many
robotics tasks where the system dynamics model is slightly inaccurate or in the presence of envi-
ronment disturbances. Despite the successes, it is still unclear how to best adjust control parameters
to the current task in the presence of model parameter uncertainty and heteroscedastic noise. In this
paper, we propose an adaptive MPC variant that automatically estimates control and model param-
eters by leveraging ideas from Bayesian optimisation (BO) and the classical expected improvement
acquisition function. We leverage recent results showing that BO can be reformulated via density
ratio estimation, which can be efficiently approximated by simply learning a classifier. This is then
integrated into a model predictive path integral control framework yielding robust controllers for a
variety of challenging robotics tasks. We demonstrate the approach on classical control problems
under model uncertainty and robotics manipulation tasks.
Keywords: Bayesian methods, Gaussian process, model predictive control, optimisation

1. Introduction
Reinforcement learning, as a framework, concerns learning how to interact with the environment
through experience, while optimal control emphasises sequential decision making and optimisation
methods. The boundaries between both fields have been diminished due to deeper understanding
and typical applications. Model predictive control (MPC) is an optimisation strategy for behaviour
generation that consists of planning actions ahead by minimising costs throughout a horizon. Re-
inforcement learning and robotics can benefit from MPC by correcting behaviours while constantly
estimating hyper-parameters. This controller learning capability can be achieved with data-driven
approaches for MPC optimisation (Görges, 2017).
We are particularly interested in path integral (PI) control (Kappen, 2005), which is a methodol-
ogy for solving nonlinear stochastic optimal control problems by sampling trajectories and comput-
ing costs. Using such methodology, model predictive path integral control (MPPI) was introduced
in (Williams et al., 2016). MPPI enables robots to navigate in stochastic and partially observable
environments, for example, in-car racing (Williams et al., 2018b). MPPI is a sampling-based and
derivative-free method which makes it a simple yet powerful strategy to simulate actions.
Within data-driven approaches, deep reinforcement learning has been successful in solving high-
dimensional control problems in simulation (Duan et al., 2016). The main limitation of deep RL
is the need for many interactions with the environment, which can be impractical with a physical
system due to costly evaluations (Peng et al., 2018). An alternative to reduce evaluations is to have
a model of the system dynamics, also called a dynamics model or transition model. Modeling an

© 2022 R. Guzman2 , R. Oliveira2 & F. Ramos1,2 .

A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS

accurate transition model inevitably leads to errors. Even so, by using data-driven approaches, it is
possible to reduce the error or adapt to the expected errors in the environment (Lee et al., 2020).
Data-driven approaches have been proposed for automatic MPC tuning, which can be seen as
an intersection between machine learning and control since they make use of the transition model in
combination with a learnt model. For example, (Sorourifar et al., 2021) presents MPC under uncer-
tainty over model parameters with Bayesian optimisation (BO) that handles constraints of system
parameters in a tank reactor. (Lee et al., 2020) addresses different environment contexts where
a robot’s dynamics could change due to a component malfunctioning. Other approaches propose
inferring simulation parameters based on data instead of uniform parameter randomisation (Peng
et al., 2018; Ramos et al., 2019). In another example, the controller optimisation is able to handle
heteroscedastic noise for control tasks (Guzman et al., 2020). Intuitively, heteroscedastic noise is a
type of noise that changes with input variables. For example, in stochastic MPC, the noise associated
with the stochastic process changes significantly with the temperature hyper-parameter (Guzman
et al., 2020), making hyper-parameter tuning quite challenging from an optimisation perspective.
We propose a data-driven approach to optimise a stochastic controller by adapting the transi-
tion model parameters to the environment. For costly system evaluations, we use surrogate-based
optimisation. Instead of merely optimising objective functions, we optimise an alternative function
approximating the more complex, higher-order model to quickly find the local or global optimum.
These are used when function evaluations are expensive and noisy due to the environment charac-
teristics. In this context, a surrogate model can amortise the optimisation. Surrogate-based mod-
elling is usually associated with Gaussian processes (Rasmussen and Williams., 2006). We consider
single-objective optimisation problems with continuous inputs and noisy observations. We work on
the case where the noise variance depends on the input. That leads us to heteroscedastic optimisa-
tion, which is a more realistic approach than the typical homoscedastic assumption. Heteroscedastic
noise can cause problems for the surrogate model and make the optimisation method deviate from
the maximum (Wang and Ierapetritou, 2017). To solve this issue, we aim at finding an optimisation
method for stochastic simulations under heteroscedastic noise.
BO uses a Gaussian process (Rasmussen and Williams., 2006) as a surrogate model commonly
used for costly black-box functions. BO proposes solutions according to an acquisition function
that encodes the trade-off between exploration and exploitation of the objective function. GPs have
excellent generalisation properties, but their computational cost can be prohibitive for big data.
Additionally, standard GPs provide analytical solutions for posteriors under homoscedastic noise,
while heteroscedastic approximations might require computationally expensive approximations.
An alternative formulation to BO, which allows the utilisation of simple classifiers within the
optimisation loop, was proposed in (Tiao et al., 2021) as Bayesian optimisation by Density Ratio
Estimation (BORE). The method introduces the concept of relative density ratio, which is used to
estimate the expected improvement acquisition function (Bull, 2011). The main advantage of this
formulation is that density ratios are bounded between 0 and 1 and can be estimated using any off-
the-shelf probabilistic classifier. Classifiers are easy to train and can handle a variety of input noise
types, including heteroscedastic, without major modifications to the classification function.
Contributions: The main contribution of this work is a new robust and adaptive MPC method
that automatically estimates distributions of model parameters and MPC hyper-parameters such as
the temperature by continuously updating a classifier that acts as a proxy for a Bayesian optimisation
step. We demonstrate that the approach provides superior performance in general control problems
and manipulation tasks under model uncertainty.

2
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS

2. BACKGROUND
2.1. Stochastic Model Predictive Control
Model predictive control resides on the idea of optimising an action sequence up to certain horizon T
while minimising the sum of action costs, starting from a current state. MPC returns a next optimal
action a∗ that is sent to the system actuators. Unlike classical deterministic MPC, stochastic MPC
allows disturbances over the states. A stochastic MPC method models disturbances as random
variables. At each time step t, stochastic MPC generates sequences of perturbed actions Vt =
{vi }t+T ∗ 2
i=t where vi = ai + ϵi and ϵi ∼ N (0, σϵ ), based on a roll-over sequence of optimal actions
∗ t+T
{ai }i=t that start at t = 0. Each action results in a state produced by a transition or dynamics
model st+1 = f (st , at ), and action sequences result in a state trajectory St = {st+i }Ti=1 . Each
trajectory has a cumulative cost C determined by instant costs c and a terminal cost q:
T
X −1
C(St ) = q (st+T ) + c(st+i ) . (1)
i=1

In stochastic MPC, the goal is to minimise the expected C(St ). The stochastic MPC method known
as model predictive path integral (MPPI) and its variations (Williams et al., 2016, 2018b) provide
optimal actions for the entire horizon following an information-theoretic approach. Constraints over
the states are determined by the transition model, and the actions are constrained according to their
limits. After M simulated rollouts, MPPI updates the sequence of optimal actions and weights:
M t+T
!!
X j j 1 1 λ X
a∗i ← a∗i + w(Vt )ϵi , w(Vt ) = exp − C(St ) + 2 a∗i · vi , (2)
η λ σϵ
j=1 i=t

where j ∈ {1, . . . , M } and η is a normalisation constant so that j=1 w(Vtj ) = 1. The hyper-
PM

parameter temperature λ ∈ R+ , λ → 0 leads to more peaked distributions for actions (Williams

et al., 2018a). The hyper-parameter σϵ2 is the control variance, which leads to more varying and
forceful actions when it is increased.

2.2. Classic Bayesian Optimisation

Bayesian optimisation (BO) has been widely applied in robotics and control to optimise black-box
(i.e., derivative-free) functions that are costly to evaluate in applications such as robotics. That is due
to the usual cost of running experiments in real robots. We use BO to perform global optimisation
in a given search space X . BO uses a Gaussian process (GP) (Rasmussen and Williams., 2006)
as a surrogate model M to internally approximate an objective function g : X → R. BO defines
an optimisation problem x∗ ∈ argmaxx∈X g(x). Then, given a set of collected observations D1:t ,
BO constructs a surrogate model that provides a posterior distribution over the objective function g.
This posterior is used to construct the acquisition function h, which measures both performance and
uncertainty of unexplored points. The next step is optimising the acquisition function, obtaining a
sample (xt , yt ). BO is summarised in Algorithm 1.
A popular acquisition function in the BO literature is the expected improvement (EI) (Bull,
2011). At iteration t, one can define yt∗ := maxi<t yt as an optimal incumbent. The expected
improvement is then defined as:
hEI (x|Dt−1 ) := E[max {0, f (x) − yt∗ } |Dt−1 ] . (3)

3
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS

In the case of a GP prior on f |Dt−1 ∼ GP(µt−1 , kp

t−1 ), for any point x where the predictive standard
deviation of the GP is non-zero, i.e., σt−1 (x) = kt−1 (x, x) > 0, the EI is given by:
hEI (x|Dt−1 ) = (µt−1 (x) − yt∗ )Ψ(st ) + σt−1 (x)ψ(st ) , (4)
µ (x)−y ∗ 2
where st := t−1 σt−1 (x) , if σt−1 (x) := kt−1 (x, x) > 0. For points where σt−1 (x) = 0, i.e., there
t

is no posterior uncertainty, we simply have hEI (x|Dt−1 ) = 0. Here Ψ(st ) and ψ(st ) denote the
cumulative and probability density functions of the standard normal distribution evaluated at st .

2.3. Bayesian Optimisation by Learning Classifiers

BO is hindered by the GP surrogate model, for which it has limitations such as cubic computa-
tional cost in training and not being directly amenable to handle variable noise structures such as
heteroscedasticity. The extensions proposed to address those issues are restrained by the necessity
to ensure analytical tractability and typically make strong and oversimplifying assumptions. For
example, (Roy et al., 2013) proposed an heteroscedastic BO approach that uses a variational ap-
proximation that can be expensive to compute. BO by Density Ratio Estimation (BORE) (Tiao
et al., 2021) was proposed based on a reformulation of the expected improvement acquisition func-
tion (Equation 3) and bypassed the challenges of analytical tractability in GP-based approaches.
BORE works by selecting points according to a density ratio similar to the Tree-structured Parzen
Estimator (TPE) proposed by Bergstra et al. (2011). TPE divides the observations, based on some
quantile hyper-parameter, into a first group that gave the best scores and a second group containing
the rest. Then, the goal is to find inputs that are more likely to be in the first group. In order to pro-
pose a new sampling point, TPE computes the ordinary density ratio from Equation 5 between the
probability a(x) of being in the first group, and the probability b(x) of being in the second group.
Instead, BORE uses the γ-relative density ratio rγ . The ordinary and γ-relative density ratios are
specified by:
a(x)
r0 (x) = a(x)/b(x) , (5) rγ (x) := . (6)
γa(x) + (1 − γ)b(x)
BORE approximates the γ-relative density ratio to a binary class posterior probability as rγ (x) ≃
γ −1 π(x), where π computes the probability of x belonging to a positive class π(x) = p(z = 1 | x).
The binary label z introduced here denotes a negative or positive class, and its meaning corresponds
to whether the point should be selected or not. In BORE, given a maximisation objective, we set
z := I[y ≥ τ ], indicating of whether the corresponding observation y at a point x is above the γth
quantile τ of the (empirical) observations distribution, i.e., γ = p(y ≥ τ ). In the end, computing the
acquisition function h from the classical BO method (Algorithm 1) is reduced to classifier training.
As shown in Algorithm 2, the optimisation depends on the hyper-parameter γ ∈ (0, 1), which
influences the exploration-exploitation trade-off. A smaller γ encourages exploitation. In this work,
we anneal γ until it reduces to a minimum value close to 0. The reason is to induce more exploration
initially and avoid local minima as much as possible.

3. Methodology
3.1. Dynamics Model Uncertainty
The dynamics of the environment is modelled as a Markovian transition model. We consider a
transition model with states s ∈ S and admissible actions a ∈ A. The state follows Markovian

4
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS

Algorithm 1: Bayesian Optimisation Algorithm 2: BORE

input : Sampling iterations n; search space X ; input : Sampling iterations n; search space X ;
hyper-parameters of h γ ∈ (0, 1); classifier to train π : X → [0, 1]
output : (x∗ , y ∗ ) output : (x∗ , y ∗ )
for t = 1 to n do for t = 1 to n do
Fit a GP model M on the data Dt−1 Train the probabilistic classifier π
Find xt = argmaxx∈X h(x, M, Dt−1 ) Find xt = arg maxx∈X π(x)
yt ← f (xt ) yt ← f (xt )
Dt ← Dt−1 ∪ {(xt , yt )} Dt ← Dt−1 ∪ {(xt , yt )}
end end

dynamics, st+1 = f (st , at ), with a transition function f and a reward function r that evaluates the
system performance given a state and action r : S × A → R. That transition model can be learned
or assumed from expert knowledge.
We propose adapting stochastic MPC to the environment by exposing the robot to different
possible scenarios by defining a transition model parameterised by a random variable θ. To find
an optimum θ, we add randomisation at each MPC trajectory rollout. We define a random vector
of transition model parameters θ and adapt them to the stochastic MPC controller. Each transition
model parameter follows a probability distribution parameterised by ψ:

θ ∼ pθ (θ; ψ), st+1 = f (st , at + ϵt , θ) , (7)

where st is the state obtained at time t, and at + ϵt is the perturbed action as described in stochastic
MPC (subsection 2.1). Note that θ is now an input to the dynamics model. Finally, optimal actions
found by MPC are sent to the system using the dynamics model f .

3.2. An Adaptive Control Formulation

We do not directly aim at finding parameters that match the observed dynamics as we would do
in system identification (Romeres et al., 2016). Instead, we adapt model parameter distributions
to the controller, which means those resulting distributions may be close to their true values. The
uncertainty of those parameters would allow the controller to adapt under different environment
circumstances or characteristics, such as the size of an obstacle or the length of a robot component.
We use the optimal MPC hyper-parameters by adapting them to the transition model parameters.
In the case of MPPI (Williams et al., 2018a), the hyper-parameters considered are λ and σϵ as
described in subsection 2.1. All controller hyper-parameters are collectively described as ϕ. We
introduce the reinforcement learning objective of optimising the episodic cumulative reward R. An
overview of the proposed framework is shown in Figure 1. The objective is to solve the reward
optimisation problem where we jointly estimate ψ, which are hyper-parameters for the dynamics
model parameter distribution pψ (θ), and the controller hyper-parameters ϕ:

ψ ⋆ , ϕ⋆ = argmax R(ψ, ϕ) . (8)

{ψ,ϕ}

To perform the optimisation in Equation 8, we adopt a surrogate-based optimisation method

that can handle noisy gradient-free functions that are common in control tasks. To this end, we
use BORE since it provides a number of advantages over traditional GP-based BO, as discussed

5
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS

Figure 1: An overview of adaptive MPC by learning classifiers.

previously, and it can handle heteroscedasticity without modifications to the core method. Het-
eroscedastic noise is common when tuning stochastic MPC in control problems (Guzman et al.,
2020). A difference between GP-based BO and BORE is that BO filters noise directly via the GP
model, while BORE relies on the classifier to account for label noise. Therefore, to mitigate noise
effects, we propose to optimise the objective function averaged over a small number ne of episodes.
To better understand the optimisation framework, Algorithm 3 describes how to estimate the op-
timal controller and dynamics hyper-parameters x∗ = {ψ ∗ , ϕ∗ } using a binaryP classifier. Following
the reinforcement learning literature, we define the cumulative reward R = ni=1 s
ri , where ns is
the number of time steps in an episode, and set our goal as maximising the expected cumulative
reward g = E[R]. We compute an empirical expected cumulative reward g by averaging R over ne
episodes. The classifier πt is trained by first assigning labels {zk }tk=1 to the data observed up to the
current iteration t. For training, the classifier uses the auxiliary dataset:

{(ψ k , ϕk , zk )}tk=1 , (9)

where the labels are obtained by separating the observed data according to γ ∈ (0, 1) by computing
the γth quantile of {gk }tk=1 :

τ ← Φ−1 (γ), zk ← I[gk ≥ τ ] for k = 1, . . . , t . (10)

The exploration-exploitation trade-off is balanced by γ, with small γ encouraging more exploitation.

Instead of keeping γ fixed, we define a strategy that first explores the search space with an initial
high γ1 that decays linearly across the iterations until a final γn . Inputs predicted as positive labels
are considered to have a higher reward, and one of them is selected by maximising the classifier
output:
ψ t , ϕt = argmax πt−1 (ψ, ϕ) . (11)
{ψ,ϕ}∈X

Note that this is equivalent to acquisition function maximisation in conventional BO and allows the
method to suggest candidate solutions efficiently. For better performance, the maximisation can be
carried out with a global optimisation method (e.g., Arnold and Hansen, 2010).

3.3. Choosing the Classifier

The probabilistic binary classifier in Equation 11 has to be chosen considering the observation noise
in the task. Some methods used in Tiao et al. (2021) are XGBoost, multi-layer perceptron (MLP),
and Random Forests (RF). For example, as an ensemble method, RF combines decision trees via

6
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS

Algorithm 3: Adaptive MPC by learning classifiers

input : Sampling iterations n
Search space X
Initial γ1 and final γn
Probabilistic binary classifier π : X → [0, 1]
output : (ψ ∗ , ϕ∗ , g ∗ )
for t = 1 to n do
t−1
γt = γ1 − n−1 (γ1 − γn ) // linear γ decay
τ ← Φ−1 (γt ) // compute γt -th quantile of {gk }tk=1
zk ← I [gk ≥ τ ] for k = 1, . . . , t − 1 // assign labels to the observed data points
Train the classifier πt−1 using {(ψ k , ϕk , zk )}t−1
k=1 // acquisition function according to BORE
ψ t , ϕt = arg max{ψ,ϕ}∈X πt−1 (ψ, ϕ) // estimate new hyper-parameters and model parameters
for j = 1 to ne do
(t)
Rj = 0
for i = 1 to ns do
θ ∼ pθ (θ; ψt ) // sample from the parameter distributions
a∗i = MPC(f, θ t , ϕt ) // use estimated parameter distributions in the new transition model
ri = SendToActuators(a∗i ) // evaluate the first action to take in the optimal trajectory
(t)
Rj += ri // accumulate rewards
end
end
Decrease γ to reduce explorability
Pne h (t) i
gt = 1/ne j=1 Rj
Dt ← Dt−1 ∪ {(ψ t , ϕt , gt )}
end

bagging. The number of decision trees should be sufficiently large to reduce classification variance
without increasing the bias. We highlight the study of this method since it is an out-of-the-box
classification method.

4. Experiments
We consider the problem of optimising the function R(x) = g (x) + ϵ with heteroscedastic input-
dependent noise ϵ ∼ N (0, σϵ2 (x)), where σϵ2 (x) denotes an input-dependent noise variance. In
these experiments the variables we optimise are the controller hyper-parameters ϕ, and the variable
ψ that parameterises transition model parameter distributions. We aim at optimising the true noise-
free function g although we only have access to the cumulative reward R. We use R as the objective
to optimise in the simulator experiments. We evaluate the performance of the proposed adaptive
model predictive control framework in several experiments described below.

4.1. Simulation Experiments

In this section, we assess the performance of the adaptive controller framework in control and
robotic tasks. Specifically, we ran tests in simulated environments, such as Pendulum, Half-Cheetah,
and Fetchreach from OpenAI1 with dense reward functions. The same functions taken from Guzman
et al. (2020) are used to compute instant costs for MPPI. We also experimented with the reaching
task for the Panda robot environment from Bhardwaj et al. (2021), with a single obstacle, a fixed
1. OpenAI Gym: https://gym.openai.com

7
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS

Environment ne T M
Control hyp. Distribution parameter range True parameter
λ ∈ [0.01, 50]
µl ∈ [0.5, 1.6] σl ∈ [0.001, 0.1] l = 1.0
Pendulum 1 10 10 σϵ ∈ [1.0, 10]
λ ∈ [0.01, 1.0]
κ ∈ [0.2, 2.0] κm,σ ∈ [0.001, 0.1] κm = 1.0
Half-Cheetah 18 15 10 σϵ ∈ [0.05, 2.0] m,µ
λ ∈ [0.01, 0.03]
κ ∈ [1.0, 50] κd,σ ∈ [0.001, 0.6] κd = 1.0
Fetchreach 90 12 3 σϵ ∈ [0.001, 0.5] d,µ
xµ ∈ [0.3, 0.32] xσ ∈ [0.001, 0.05] x = 0.3
Panda 10 150 20 λ ∈ [0.01, 2.0] yµ ∈ [0.1, 0.12] yσ ∈ [0.001, 0.01] y = 0.1
zµ ∈ [0.6, 0.62] zσ ∈ [0.001, 0.03] z = 0.6

Table 1: Ranges of hyper-parameters and model parameters.

target location, and a fixed initial robot position. MPPI trajectory evaluations are done on the GPU,
which helped overcome efficiency issues.
The purpose of the Panda task shown in Figure 2 is to reach
the yellow target while avoiding obstacle collision. The obstacle
has true length x = 0.3, width y = 0.1, and height z = 0.6. We
assume partial observability for such obstacle dimension sizes and
attempt to infer them as part of the transition model parameters for
which we define search spaces shown in Table 1. The transition
model parameter l is the rod length for Pendulum, κm is the mass
scaling factor for all the links in Half-Cheetah, and κd is a damping
Figure 2: Panda robot setup
ratio scaling factor for all components in Fetchreach. For Panda,
we optimise the obstacle dimensions x, y, and z. Each transition model parameter is a random
variable parameterised by ψ, for example ψ = {κd,µ , κd,σ } are damping ratio mean and damping
ratio standard deviation for the Fetchreach.
4.2. Method Configuration
We compare the proposed adaptive MPC framework with other surrogate-based methods used in
robotics. All the compared methods are configured as follows.

Adaptive MPC framework: First, we have to determine a classi- Panda evaluations

Averaged episodic reward

fier that can deal with the stochasticity of robotic tasks. We com-
−24000
pare two probabilistic classifiers: Random Forest (RF) with 50 deci-
sion trees denoted as BORE-RF, and Multi-layer Perceptron (MLP) −26000
γ = 0.15
1

classifier denoted as BORE-MLP with 2 hidden layers, each with γ = 0.5

1
−28000 γ = 0.85
1
32 units, ReLU activations and sigmoid for the output layer. The
0 10 20 30 40 50
weights were optimised for 1000 epochs using binary cross-entropy Iterations
loss and ADAM optimiser with a batch size of 32. Second, we start
Figure 3: Some γ evaluations
by exploring with a γ1 = 0.5 that decays linearly across the itera-
tions until a reasonable final γn = 0.05. We compare different initial γ1 values in Figure 3 for the
Panda task. With a low γ1 , BORE could stay stuck in some local minimum, and with a higher γ1 ,
BORE would do more exploration first before exploiting some region. A reasonable initial value
is γ1 = 0.5, corresponding to the median, which has shown an optimal compromise according to
the preliminary results in Figure 3. Finally, we set a parameter distribution pθ with positive support

8
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS

BORE-MLP BORE-RF BOhetero BOhomo TPE CMA-ES

Pendulum Half-Cheetah Fetchreach Panda

−10000
Averaged episodic reward

600

−20000
Fetchreach
−1.5 −24000
400 −2.0
−26000
−30000 200 −2.5
−1.5
−3.0 −28000
−40000 0
10 20 30 40 50
Iterations
−2.010 20 30
Iterations
40 50 10 20 30
Iterations
40 50 0 10 20 30 40 50
Iterations

−2.5rewards per iteration where the shaded areas correspond to 1.5

Figure 4: Averaged cumulative
standard deviations. Each method started at a point with minimum expected cumulative reward.
−3.0
since we deal with physics variables (mass,
10 20 damping
30 ratio)
40 and50sizes. We choose the gamma2 dis-
tribution Γ(α, β), and we transform the provided mean µ and standard deviation σ via α = σµ2 and
β = σµ2 .

BO methods: We compare the proposed method with the traditional homoscedastic B0homo and
a heteroscedastic BOhetero version from Guzman et al. (2020). We collected 400 data points for
the control and robotic tasks over the search spaces shown in Table 1. Then, using such data, we
optimise BO’s hyper-parameters via maximum GP marginal likelihood optimisation. That marginal
likelihood and the acquisition function are optimised using multi-start L-BFGS-B. We used a UCB
acquisition function hUCB (x) = µ(x) + δσ(x) with balance factor δ = 3.0. For both BO, we used
the anisotropic squared exponential kernel k(x, x′ ) = σn exp −1/2(x − x′ )T diag(ℓ)2 (x − x′ ) .

Other methods: We use TPE optimisation (Bergstra et al., 2011) with quantile value 0.5 since
it is what BORE is based on, and finally, we use covariance matrix adaptation evolution strategy
(CMA-ES) (Arnold and Hansen, 2010) as a non-BO baseline set with σ0 = 10 and population size
2. CMA-ES has been widely used for hyper-parameter tuning in robotics (Modugno et al., 2016;
Sharifzadeh et al., 2021).

4.3. Optimisation Assessment

To quantitatively
P′ assess optimisation performance, we report averaged cumulative rewards, defined
as t1′ ti=1 gi from t′ = 1 to t′ = n. To account for the stochasticity of evaluations, we repeat those
50 iterations 3 times. The result of repeating gives averaged cumulative rewards with their respective
standard deviations. We use the cumulative rewards R to approximate g by averaging over ne = 10
episodes. Each episode consists of 480 time steps for Panda and 200 time steps for the other tasks.
We compare the averaged cumulative reward against the number of iterations. Figure 4 shows that
BOhomo and BOhetero perform similarly mainly in Pendulum because of homoscedastic noise across
the search space. However, BOhomo tends to converge to a local minimum in other tasks, which is
expected since it does not account for heteroscedasticity. It is possible to achieve better or equal
results with TPE, although it seems to also get stuck since it only divides observations based on
the output and chooses the best next point without considering unseen regions. Both BO versions

9
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS

Method Rmax Rσ λ xµ xσ yµ yσ zµ zσ
BORE-MLP -21106.36 724.10 1.65 0.3104 0.0010 0.1000 0.0010 0.6000 0.0058
BORE-RF -21891.82 291.64 1.67 0.3036 0.0431 0.1001 0.0013 0.6145 0.0234
BOhetero -23025.47 162.22 1.73 0.3185 0.0500 0.1200 0.0028 0.6145 0.0118
BOhomo -22870.14 158.46 2.00 0.3200 0.0010 0.1000 0.0010 0.6200 0.0010
TPE -23438.34 121.26 1.63 0.3124 0.0159 0.1152 0.0045 0.6047 0.0067
CMA-ES -23779.35 481.55 1.47 0.3200 0.0010 0.1200 0.0010 0.6200 0.0142

Table 2: Maximum reward found at the last iteration for the Panda task.

are being outperformed by BORE-MLP and BORE-RF. BORE-MLP converges quite faster to an
optimum in most tasks. In the PANDA environment, the difference is higher, and it suggests that
the proposed framework performs better in higher-dimensional problems.

4.4. Evaluating the Optima

The previous section emphasised the proposed MPC framework and its ability to explore efficiently
compared to other optimisation methods. This section describes the optima found by the methods in
the Panda environment, where the improvement is more noticeable. In Table 2, we show the control
hyper-parameters ϕ = {λ} and the transition model parameters ψ = {xµ , xσ , yµ , yσ , zµ , zσ } that
give the maximum reward Rmax at the last iteration after running each method for 50 iterations. Rσ
is the observed standard deviation of the reward at the respective iteration. BORE-MLP is able to
find an optimum close to the one found by BORE-RF. Rσ is higher for BORE-MLP as the method
is still exploring new unseen regions at the end, and it can still improve its current maximum. The
table also shows the optimised parameters for the distribution-based sizes: orange for the length x,
green for the width y, and cyan for the height z. BORE-MLP and almost all the other methods found
that considering more uncertainty in the obstacle height z would provide a higher reward, which is
understandable considering that the gripper could find convenient trajectories by moving over the
obstacle. The most relevant dimension size is the width y, since a wrong y would result in obstacle
collision. Meanwhile, all methods can allow more uncertainty about the length of the obstacle as it
does not affect the collision. Most methods converge to a similar controller hyper-parameter λ.

5. Conclusion
This paper presented an adaptive variant of model predictive control that automatically estimates
model parameter distributions and tunes MPC hyper-parameters within a Bayesian optimisation
framework. In contrast to previous approaches, our formulation is the first to show that global
optimisation can be accomplished by learning a classifier that estimates density ratios. We studied
the empirical performance of the framework with different classifiers and against benchmark BO
versions. The proposed method was able to surpass the performance of the traditional BO and a
heteroscedastic BO variation. Our results indicate the flexibility of using density-ratio estimation to
optimise MPC and how it can impact the performance of MPC in control and robotic tasks under
dynamics model uncertainty. Future research directions include obtaining theoretical results on
the effects of heteroscedasticity, and we could explore alternative non-normal distributions for the
actions that could be more suitable for the tasks.

10
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS

References
Dirk V. Arnold and Nikolaus Hansen. Active covariance matrix adaptation for the (1+1)-CMA-
ES. In Proceedings of the 12th annual conference on Genetic and evolutionary computation -
GECCO ’10, page 385, Portland, OR, 2010. ACM.

James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for Hyper-
parameter Optimization. In Advances in Neural Information Processing Systems, pages 2546–
2554, 2011.

Mohak Bhardwaj, Balakumar Sundaralingam, Arsalan Mousavian, Nathan D Ratliff, Dieter Fox,
Fabio Ramos, and Byron Boots. Storm: An integrated framework for fast joint-space model-
predictive control for reactive manipulation. In 5th Annual Conference on Robot Learning, 2021.

Adam D. Bull. Convergence Rates of Efficient Global Optimization Algorithms. Journal of Machine
Learning Research (JMLR), 12:2879–2904, 2011.

Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep rein-
forcement learning for continuous control. 33rd International Conference on Machine Learning,
ICML 2016, 3:2001–2014, 2016.

Daniel Görges. Relations between Model Predictive Control and Reinforcement Learning. IFAC-
PapersOnLine, 50(1):4920–4928, 2017. ISSN 24058963. doi: 10.1016/j.ifacol.2017.08.747.

Rel Guzman, Rafael Oliveira, and Fabio Ramos. Heteroscedastic Bayesian Optimisation for
Stochastic Model Predictive Control. IEEE Robotics and Automation Letters, 6(1):1–1, 2020.
doi: 10.1109/lra.2020.3028830.

Hilbert J Kappen. Linear theory for control of nonlinear stochastic systems. Physical review letters,
95(20):200201, 2005.

Kimin Lee, Younggyo Seo, Seunghyun Lee, Honglak Lee, and Jinwoo Shin. Context-aware Dy-
namics Model for Generalization in Model-Based Reinforcement Learning. 2020.

Valerio Modugno, Gerard Neumann, Elmar Rueckert, Giuseppe Oriolo, Jan Peters, and Serena
Ivaldi. Learning soft task priorities for control of redundant robots. In 2016 IEEE International
Conference on Robotics and Automation (ICRA), pages 221–226. IEEE, 2016.

Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-Real Transfer
of Robotic Control with Dynamics Randomization. In 2018 IEEE International Conference on
Robotics and Automation (ICRA), pages 3803–3810, Brisbane, Australia, 2018.

Fabio Ramos, Rafael Carvalhaes Possas, and Dieter Fox. BayesSim : adaptive domain randomiza-
tion via probabilistic inference for robotics simulators. In Robotics: Science and Systems (RSS),
Freiburg im Breisgau, Germany, 2019.

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning.
2006. ISBN 026218253X.

11
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS

D. Romeres, G. Prando, G. Pillonetto, and A. Chiuso. On-line bayesian system identification. In

2016 European Control Conference (ECC), pages 1359–1364, 2016. doi: 10.1109/ECC.2016.
7810478.

Nicholas Roy, Paul Newman, and Siddhartha Srinivasa. Variational bayesian optimization for run-
time risk-sensitive control. 2013.

Mohammad Sharifzadeh, Yuhao Jiang, Amir Salimi Lafmejani, Kevin Nichols, and Daniel Aukes.
Maneuverable gait selection for a novel fish-inspired robot using a cma-es-assisted workflow.
Bioinspiration & Biomimetics, 16(5):056017, 2021.

Farshud Sorourifar, Georgios Makrygirgos, Ali Mesbah, and Joel A Paulson. A data-driven auto-
matic tuning method for mpc under uncertainty using constrained bayesian optimization. IFAC-
PapersOnLine, 54(3):243–250, 2021.

Louis C Tiao, Aaron Klein, Cédric Archambeau, Edwin V Bonilla, Matthias Seeger, and Fabio
Ramos. Bayesian optimization by density ratio estimation. In Proceedings of the 38th Interna-
tional Conference on Machine Learning (ICML2021), Virtual (Online), 2021.

Zilong Wang and Marianthi Ierapetritou. A novel surrogate-based optimization method for black-
box simulation with heteroscedastic noise. Industrial & Engineering Chemistry Research, 56
(38):10720–10732, 2017. doi: 10.1021/acs.iecr.7b00867.

Grady Williams, Paul Drews, Brian Goldfain, James M. Rehg, and Evangelos A. Theodorou. Ag-
gressive driving with model predictive path integral control. Proceedings - IEEE International
Conference on Robotics and Automation, 2016-June:1433–1440, 2016. ISSN 10504729. doi:
10.1109/ICRA.2016.7487277.

Grady Williams, Paul Drews, Brian Goldfain, James M. Rehg, and IEvangelos A. Theodorou.
Information-Theoretic Model Predictive Control: Theory and Applications to Autonomous
Driving. IEEE Transactions on Robotics, 34(6):1603–1622, 2018a. ISSN 15523098. doi:
10.1109/TRO.2018.2865891.

Grady Williams, Brian Goldfain, Paul Drews, Kamil Saigol, James Rehg, and Evangelos
Theodorou. Robust Sampling Based Model Predictive Control with Sparse Objective Informa-
tion. (c), 2018b. doi: 10.15607/rss.2018.xiv.042.

2772 Blending MPC Value Function AP
No ratings yet
2772 Blending MPC Value Function AP
16 pages
Computationally Efficient Gauss-Newton Reinforcement Learning For Model Predictive Control
No ratings yet
Computationally Efficient Gauss-Newton Reinforcement Learning For Model Predictive Control
14 pages
GPMPC
No ratings yet
GPMPC
8 pages
Real-Time Nonlinear MPC with GPUs
No ratings yet
Real-Time Nonlinear MPC with GPUs
8 pages
Polynomial Chaos-Based Stochastic Model Predictive Control: An Overview and Future Research Directions
No ratings yet
Polynomial Chaos-Based Stochastic Model Predictive Control: An Overview and Future Research Directions
45 pages
Adaptive MPC for Linear Systems
No ratings yet
Adaptive MPC for Linear Systems
8 pages
Robotics Control with MuJoCo
No ratings yet
Robotics Control with MuJoCo
14 pages
T - M P P I C: Ransformer Based Odel Redictive ATH Ntegral Ontrol
No ratings yet
T - M P P I C: Ransformer Based Odel Redictive ATH Ntegral Ontrol
15 pages
Adaptative For Lineare
No ratings yet
Adaptative For Lineare
23 pages
Data-Enabled Predictive Control Algorithm
No ratings yet
Data-Enabled Predictive Control Algorithm
8 pages
MPC For MCP
No ratings yet
MPC For MCP
7 pages
Incorporating Recurrent Reinforcement Learning Into Model Predictive
No ratings yet
Incorporating Recurrent Reinforcement Learning Into Model Predictive
7 pages
Shield Model Predictive Path Integral: A Computationally Efficient Robust MPC Approach Using Control Barrier Functions
No ratings yet
Shield Model Predictive Path Integral: A Computationally Efficient Robust MPC Approach Using Control Barrier Functions
8 pages
Provably Safe and Robust Learning-BasedModel Predictive Control
No ratings yet
Provably Safe and Robust Learning-BasedModel Predictive Control
13 pages
Robust MPC with Recursive Updates
No ratings yet
Robust MPC with Recursive Updates
11 pages
Reinforcement Learning for Experts
No ratings yet
Reinforcement Learning for Experts
13 pages
SLAP
No ratings yet
SLAP
11 pages
State-Space Modeling For Control Based On Physics-Informed Neural Networks
No ratings yet
State-Space Modeling For Control Based On Physics-Informed Neural Networks
10 pages
MPC Emergin Survey
No ratings yet
MPC Emergin Survey
25 pages
Model Predictive Control
No ratings yet
Model Predictive Control
12 pages
Good
No ratings yet
Good
10 pages
RLC Project
No ratings yet
RLC Project
13 pages
Efficient Nonlinear Model Predictive Path Integral Control For Stochastic Systems Considering Input Constraints
No ratings yet
Efficient Nonlinear Model Predictive Path Integral Control For Stochastic Systems Considering Input Constraints
6 pages
Pan 2022 J. Phys. Conf. Ser. 2203 012058
No ratings yet
Pan 2022 J. Phys. Conf. Ser. 2203 012058
7 pages
A Safe Reinforcement Learning Driven Weights-Varying Model Predictive Control For Autonomous Vehicle Motion Control
No ratings yet
A Safe Reinforcement Learning Driven Weights-Varying Model Predictive Control For Autonomous Vehicle Motion Control
8 pages
Bayesian Optimization Report
No ratings yet
Bayesian Optimization Report
6 pages
Uncertainty-Aware Digital Twins: Robust Model Predictive Control Using Time-Series Deep Quantile Learning
No ratings yet
Uncertainty-Aware Digital Twins: Robust Model Predictive Control Using Time-Series Deep Quantile Learning
26 pages
1 s2.0 S0005109824004370 Main
No ratings yet
1 s2.0 S0005109824004370 Main
9 pages
Simplifying Bayesian Optimization Via In-Context Direct Optimum Sampling
No ratings yet
Simplifying Bayesian Optimization Via In-Context Direct Optimum Sampling
14 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
Physics-Informed Gaussian Processes As Linear Model Predictive Controller
No ratings yet
Physics-Informed Gaussian Processes As Linear Model Predictive Controller
13 pages
Bayesian Differential Programming For Robust Systems Identification Under Uncertainty
No ratings yet
Bayesian Differential Programming For Robust Systems Identification Under Uncertainty
23 pages
Learning-Enhanced Nonlinear Model Predictive Control Using Knowledge-Based Neural Ordinary Differential Equations and Deep Ensembles
No ratings yet
Learning-Enhanced Nonlinear Model Predictive Control Using Knowledge-Based Neural Ordinary Differential Equations and Deep Ensembles
16 pages
Iterative MPC with High-Order CBFs
No ratings yet
Iterative MPC with High-Order CBFs
8 pages
Reinforcement Learning Optimization
No ratings yet
Reinforcement Learning Optimization
6 pages
Statistical Model Checking Advances
No ratings yet
Statistical Model Checking Advances
42 pages
Bayesian Optimization for Biophysics
No ratings yet
Bayesian Optimization for Biophysics
137 pages
Advanced MBRL for Efficient Control
No ratings yet
Advanced MBRL for Efficient Control
17 pages
RLC Project Report
No ratings yet
RLC Project Report
2 pages
Hansen 2022
No ratings yet
Hansen 2022
20 pages
8-Data Driven MPC
100% (1)
8-Data Driven MPC
94 pages
Engineering Applications of Arti Ficial Intelligence: Brandon M. Reese, Emmanuel G. Collins JR
No ratings yet
Engineering Applications of Arti Ficial Intelligence: Brandon M. Reese, Emmanuel G. Collins JR
19 pages
Adaptive Model Predictive Control For A Class of Constrained Linear Systems Based On The Comparison Model
No ratings yet
Adaptive Model Predictive Control For A Class of Constrained Linear Systems Based On The Comparison Model
8 pages
Nonparametric Adaptive Bayesian Stochastic Control Under Model Uncertainty
No ratings yet
Nonparametric Adaptive Bayesian Stochastic Control Under Model Uncertainty
28 pages
Lecture Notes MPC
100% (1)
Lecture Notes MPC
264 pages
Adaptive Learning-Based Model Predictive Control For Uncertain Interconnected Systems: A Set Membership Identification Approach
No ratings yet
Adaptive Learning-Based Model Predictive Control For Uncertain Interconnected Systems: A Set Membership Identification Approach
11 pages
Reinforcement Learning As Classification: Leveraging Modern Classifiers
No ratings yet
Reinforcement Learning As Classification: Leveraging Modern Classifiers
8 pages
Audio To Text Embedding
No ratings yet
Audio To Text Embedding
144 pages
Learning-Based Control of Continuous-Time Systems Using Output Feedback
No ratings yet
Learning-Based Control of Continuous-Time Systems Using Output Feedback
8 pages
High-Precision Quick Control in Multivariable Time-Varying Nonlinear System A Biological Decision Model Predictive Control Algorithm
No ratings yet
High-Precision Quick Control in Multivariable Time-Varying Nonlinear System A Biological Decision Model Predictive Control Algorithm
13 pages
Measurement: Ankush Chakrabarty, Suvadeep Banerjee, Sayan Maity, Amitava Chatterjee
No ratings yet
Measurement: Ankush Chakrabarty, Suvadeep Banerjee, Sayan Maity, Amitava Chatterjee
14 pages
Bayesian and Surroagte 2
No ratings yet
Bayesian and Surroagte 2
14 pages
Sample-Efficient Model Predictive Control Design of Soft Robotics by Bayesian Optimization
No ratings yet
Sample-Efficient Model Predictive Control Design of Soft Robotics by Bayesian Optimization
6 pages
Sample-Efficient Reinforcement Learning in The Presence of Exogenous Information
No ratings yet
Sample-Efficient Reinforcement Learning in The Presence of Exogenous Information
56 pages
Full Bayesian Optimization Report
No ratings yet
Full Bayesian Optimization Report
12 pages
Reinforcement Learning and Optimal Control - Draft Version by Dmitri Bertsekas
No ratings yet
Reinforcement Learning and Optimal Control - Draft Version by Dmitri Bertsekas
268 pages
Data-Driven Model Predictive Control: Closed-Loop Guarantees and Experimental Results
No ratings yet
Data-Driven Model Predictive Control: Closed-Loop Guarantees and Experimental Results
11 pages
Adaptive Stochastic Nonlinear Model Predictive Control With Look-Ahead Dee
No ratings yet
Adaptive Stochastic Nonlinear Model Predictive Control With Look-Ahead Dee
8 pages
Iagc 1 Pager Intro To Seismic Final
No ratings yet
Iagc 1 Pager Intro To Seismic Final
2 pages
Makavana Ashish I19PH019: Arrangement For Quincke's Method
No ratings yet
Makavana Ashish I19PH019: Arrangement For Quincke's Method
5 pages
TOEFL Practice Questions with Answers
No ratings yet
TOEFL Practice Questions with Answers
4 pages
Unit Plan Grade 1 Throwing and Catching
No ratings yet
Unit Plan Grade 1 Throwing and Catching
9 pages
ECN+3111-introduction History of Eco Thought
No ratings yet
ECN+3111-introduction History of Eco Thought
21 pages
Playing With Numbers - S1 - Maths Base Camp L-1
No ratings yet
Playing With Numbers - S1 - Maths Base Camp L-1
29 pages
RIICWD507D Student Guide
No ratings yet
RIICWD507D Student Guide
86 pages
Save the Ganga Initiative
No ratings yet
Save the Ganga Initiative
12 pages
AES Encryption Standard Overview
No ratings yet
AES Encryption Standard Overview
61 pages
BSAB-Intro To Enterpreneurship..
No ratings yet
BSAB-Intro To Enterpreneurship..
9 pages
Projections Notes 231022 161332
No ratings yet
Projections Notes 231022 161332
12 pages
Versant P
No ratings yet
Versant P
28 pages
DLP English-6
0% (1)
DLP English-6
379 pages
Answer Sheet For Mini Test 2
No ratings yet
Answer Sheet For Mini Test 2
3 pages
Chemical Equipment Design Guide
No ratings yet
Chemical Equipment Design Guide
28 pages
Academic Program Registrations List
No ratings yet
Academic Program Registrations List
23 pages
Grade 9 Q4 Week 1 and 2 (Ready)
No ratings yet
Grade 9 Q4 Week 1 and 2 (Ready)
12 pages
Writing The Position Paper: Identifying The Significant Issues in Society
No ratings yet
Writing The Position Paper: Identifying The Significant Issues in Society
18 pages
Delhi Public School, Gurgaon Class Vii English Topic: Application To The Principal Worksheet: 13
No ratings yet
Delhi Public School, Gurgaon Class Vii English Topic: Application To The Principal Worksheet: 13
3 pages
Pibloktos Analysis
No ratings yet
Pibloktos Analysis
2 pages
Lesson 3 - Adaptation Assignment
No ratings yet
Lesson 3 - Adaptation Assignment
3 pages
Beauty Products Business Younique Roll No1902
No ratings yet
Beauty Products Business Younique Roll No1902
2 pages
Ticketing System: User Manual
No ratings yet
Ticketing System: User Manual
24 pages
We The People Benjamin Ginsberg Download
100% (3)
We The People Benjamin Ginsberg Download
135 pages
Part One: Reading (15 Points) A/ Comprehension and Interpretation (08 PTS)
No ratings yet
Part One: Reading (15 Points) A/ Comprehension and Interpretation (08 PTS)
4 pages
Astm C39 C39M-16
No ratings yet
Astm C39 C39M-16
7 pages
4Rs BY ALYSSA ALEGADO
No ratings yet
4Rs BY ALYSSA ALEGADO
17 pages
DCN-V2: Enhanced Web-Scale Ranking
No ratings yet
DCN-V2: Enhanced Web-Scale Ranking
14 pages
Piping Specification Guide
No ratings yet
Piping Specification Guide
24 pages
TOK Exibition Draft - Is Bias Inevitable in The Prodcution of Knowledge
No ratings yet
TOK Exibition Draft - Is Bias Inevitable in The Prodcution of Knowledge
6 pages

Guzman 22 A

Uploaded by

Guzman 22 A

Uploaded by

Proceedings of Machine Learning Research vol 168:1–12, 2022 4th Annual Conference on Learning for Dynamics and Control

Adaptive Model Predictive Control by Learning Classifiers

Rel Guzman2 REL . GUZMANAPAZA @ SYDNEY. EDU . AU

Editors: R. Firoozi, N. Mehr, E. Yel, R. Antonova, J. Bohg, M. Schwager, M. Kochenderfer

© 2022 R. Guzman2 , R. Oliveira2 & F. Ramos1,2 .

parameter temperature λ ∈ R+ , λ → 0 leads to more peaked distributions for actions (Williams

2.2. Classic Bayesian Optimisation

In the case of a GP prior on f |Dt−1 ∼ GP(µt−1 , kp

2.3. Bayesian Optimisation by Learning Classifiers

Algorithm 1: Bayesian Optimisation Algorithm 2: BORE

θ ∼ pθ (θ; ψ), st+1 = f (st , at + ϵt , θ) , (7)

3.2. An Adaptive Control Formulation

ψ ⋆ , ϕ⋆ = argmax R(ψ, ϕ) . (8)

To perform the optimisation in Equation 8, we adopt a surrogate-based optimisation method

Figure 1: An overview of adaptive MPC by learning classifiers.

{(ψ k , ϕk , zk )}tk=1 , (9)

τ ← Φ−1 (γ), zk ← I[gk ≥ τ ] for k = 1, . . . , t . (10)

The exploration-exploitation trade-off is balanced by γ, with small γ encouraging more exploitation.

3.3. Choosing the Classifier

Algorithm 3: Adaptive MPC by learning classifiers

4.1. Simulation Experiments

Table 1: Ranges of hyper-parameters and model parameters.

Adaptive MPC framework: First, we have to determine a classi- Panda evaluations

classifier denoted as BORE-MLP with 2 hidden layers, each with γ = 0.5

BORE-MLP BORE-RF BOhetero BOhomo TPE CMA-ES

Pendulum Half-Cheetah Fetchreach Panda

−2.5rewards per iteration where the shaded areas correspond to 1.5

4.3. Optimisation Assessment

4.4. Evaluating the Optima

D. Romeres, G. Prando, G. Pillonetto, and A. Chiuso. On-line bayesian system identification. In

You might also like