Guzman 22 A
Guzman 22 A
Abstract
Stochastic model predictive control has been a successful and robust control framework for many
robotics tasks where the system dynamics model is slightly inaccurate or in the presence of envi-
ronment disturbances. Despite the successes, it is still unclear how to best adjust control parameters
to the current task in the presence of model parameter uncertainty and heteroscedastic noise. In this
paper, we propose an adaptive MPC variant that automatically estimates control and model param-
eters by leveraging ideas from Bayesian optimisation (BO) and the classical expected improvement
acquisition function. We leverage recent results showing that BO can be reformulated via density
ratio estimation, which can be efficiently approximated by simply learning a classifier. This is then
integrated into a model predictive path integral control framework yielding robust controllers for a
variety of challenging robotics tasks. We demonstrate the approach on classical control problems
under model uncertainty and robotics manipulation tasks.
Keywords: Bayesian methods, Gaussian process, model predictive control, optimisation
1. Introduction
Reinforcement learning, as a framework, concerns learning how to interact with the environment
through experience, while optimal control emphasises sequential decision making and optimisation
methods. The boundaries between both fields have been diminished due to deeper understanding
and typical applications. Model predictive control (MPC) is an optimisation strategy for behaviour
generation that consists of planning actions ahead by minimising costs throughout a horizon. Re-
inforcement learning and robotics can benefit from MPC by correcting behaviours while constantly
estimating hyper-parameters. This controller learning capability can be achieved with data-driven
approaches for MPC optimisation (Görges, 2017).
We are particularly interested in path integral (PI) control (Kappen, 2005), which is a methodol-
ogy for solving nonlinear stochastic optimal control problems by sampling trajectories and comput-
ing costs. Using such methodology, model predictive path integral control (MPPI) was introduced
in (Williams et al., 2016). MPPI enables robots to navigate in stochastic and partially observable
environments, for example, in-car racing (Williams et al., 2018b). MPPI is a sampling-based and
derivative-free method which makes it a simple yet powerful strategy to simulate actions.
Within data-driven approaches, deep reinforcement learning has been successful in solving high-
dimensional control problems in simulation (Duan et al., 2016). The main limitation of deep RL
is the need for many interactions with the environment, which can be impractical with a physical
system due to costly evaluations (Peng et al., 2018). An alternative to reduce evaluations is to have
a model of the system dynamics, also called a dynamics model or transition model. Modeling an
accurate transition model inevitably leads to errors. Even so, by using data-driven approaches, it is
possible to reduce the error or adapt to the expected errors in the environment (Lee et al., 2020).
Data-driven approaches have been proposed for automatic MPC tuning, which can be seen as
an intersection between machine learning and control since they make use of the transition model in
combination with a learnt model. For example, (Sorourifar et al., 2021) presents MPC under uncer-
tainty over model parameters with Bayesian optimisation (BO) that handles constraints of system
parameters in a tank reactor. (Lee et al., 2020) addresses different environment contexts where
a robot’s dynamics could change due to a component malfunctioning. Other approaches propose
inferring simulation parameters based on data instead of uniform parameter randomisation (Peng
et al., 2018; Ramos et al., 2019). In another example, the controller optimisation is able to handle
heteroscedastic noise for control tasks (Guzman et al., 2020). Intuitively, heteroscedastic noise is a
type of noise that changes with input variables. For example, in stochastic MPC, the noise associated
with the stochastic process changes significantly with the temperature hyper-parameter (Guzman
et al., 2020), making hyper-parameter tuning quite challenging from an optimisation perspective.
We propose a data-driven approach to optimise a stochastic controller by adapting the transi-
tion model parameters to the environment. For costly system evaluations, we use surrogate-based
optimisation. Instead of merely optimising objective functions, we optimise an alternative function
approximating the more complex, higher-order model to quickly find the local or global optimum.
These are used when function evaluations are expensive and noisy due to the environment charac-
teristics. In this context, a surrogate model can amortise the optimisation. Surrogate-based mod-
elling is usually associated with Gaussian processes (Rasmussen and Williams., 2006). We consider
single-objective optimisation problems with continuous inputs and noisy observations. We work on
the case where the noise variance depends on the input. That leads us to heteroscedastic optimisa-
tion, which is a more realistic approach than the typical homoscedastic assumption. Heteroscedastic
noise can cause problems for the surrogate model and make the optimisation method deviate from
the maximum (Wang and Ierapetritou, 2017). To solve this issue, we aim at finding an optimisation
method for stochastic simulations under heteroscedastic noise.
BO uses a Gaussian process (Rasmussen and Williams., 2006) as a surrogate model commonly
used for costly black-box functions. BO proposes solutions according to an acquisition function
that encodes the trade-off between exploration and exploitation of the objective function. GPs have
excellent generalisation properties, but their computational cost can be prohibitive for big data.
Additionally, standard GPs provide analytical solutions for posteriors under homoscedastic noise,
while heteroscedastic approximations might require computationally expensive approximations.
An alternative formulation to BO, which allows the utilisation of simple classifiers within the
optimisation loop, was proposed in (Tiao et al., 2021) as Bayesian optimisation by Density Ratio
Estimation (BORE). The method introduces the concept of relative density ratio, which is used to
estimate the expected improvement acquisition function (Bull, 2011). The main advantage of this
formulation is that density ratios are bounded between 0 and 1 and can be estimated using any off-
the-shelf probabilistic classifier. Classifiers are easy to train and can handle a variety of input noise
types, including heteroscedastic, without major modifications to the classification function.
Contributions: The main contribution of this work is a new robust and adaptive MPC method
that automatically estimates distributions of model parameters and MPC hyper-parameters such as
the temperature by continuously updating a classifier that acts as a proxy for a Bayesian optimisation
step. We demonstrate that the approach provides superior performance in general control problems
and manipulation tasks under model uncertainty.
2
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS
2. BACKGROUND
2.1. Stochastic Model Predictive Control
Model predictive control resides on the idea of optimising an action sequence up to certain horizon T
while minimising the sum of action costs, starting from a current state. MPC returns a next optimal
action a∗ that is sent to the system actuators. Unlike classical deterministic MPC, stochastic MPC
allows disturbances over the states. A stochastic MPC method models disturbances as random
variables. At each time step t, stochastic MPC generates sequences of perturbed actions Vt =
{vi }t+T ∗ 2
i=t where vi = ai + ϵi and ϵi ∼ N (0, σϵ ), based on a roll-over sequence of optimal actions
∗ t+T
{ai }i=t that start at t = 0. Each action results in a state produced by a transition or dynamics
model st+1 = f (st , at ), and action sequences result in a state trajectory St = {st+i }Ti=1 . Each
trajectory has a cumulative cost C determined by instant costs c and a terminal cost q:
T
X −1
C(St ) = q (st+T ) + c(st+i ) . (1)
i=1
In stochastic MPC, the goal is to minimise the expected C(St ). The stochastic MPC method known
as model predictive path integral (MPPI) and its variations (Williams et al., 2016, 2018b) provide
optimal actions for the entire horizon following an information-theoretic approach. Constraints over
the states are determined by the transition model, and the actions are constrained according to their
limits. After M simulated rollouts, MPPI updates the sequence of optimal actions and weights:
M t+T
!!
X j j 1 1 λ X
a∗i ← a∗i + w(Vt )ϵi , w(Vt ) = exp − C(St ) + 2 a∗i · vi , (2)
η λ σϵ
j=1 i=t
where j ∈ {1, . . . , M } and η is a normalisation constant so that j=1 w(Vtj ) = 1. The hyper-
PM
3
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS
is no posterior uncertainty, we simply have hEI (x|Dt−1 ) = 0. Here Ψ(st ) and ψ(st ) denote the
cumulative and probability density functions of the standard normal distribution evaluated at st .
3. Methodology
3.1. Dynamics Model Uncertainty
The dynamics of the environment is modelled as a Markovian transition model. We consider a
transition model with states s ∈ S and admissible actions a ∈ A. The state follows Markovian
4
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS
dynamics, st+1 = f (st , at ), with a transition function f and a reward function r that evaluates the
system performance given a state and action r : S × A → R. That transition model can be learned
or assumed from expert knowledge.
We propose adapting stochastic MPC to the environment by exposing the robot to different
possible scenarios by defining a transition model parameterised by a random variable θ. To find
an optimum θ, we add randomisation at each MPC trajectory rollout. We define a random vector
of transition model parameters θ and adapt them to the stochastic MPC controller. Each transition
model parameter follows a probability distribution parameterised by ψ:
where st is the state obtained at time t, and at + ϵt is the perturbed action as described in stochastic
MPC (subsection 2.1). Note that θ is now an input to the dynamics model. Finally, optimal actions
found by MPC are sent to the system using the dynamics model f .
5
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS
previously, and it can handle heteroscedasticity without modifications to the core method. Het-
eroscedastic noise is common when tuning stochastic MPC in control problems (Guzman et al.,
2020). A difference between GP-based BO and BORE is that BO filters noise directly via the GP
model, while BORE relies on the classifier to account for label noise. Therefore, to mitigate noise
effects, we propose to optimise the objective function averaged over a small number ne of episodes.
To better understand the optimisation framework, Algorithm 3 describes how to estimate the op-
timal controller and dynamics hyper-parameters x∗ = {ψ ∗ , ϕ∗ } using a binaryP classifier. Following
the reinforcement learning literature, we define the cumulative reward R = ni=1 s
ri , where ns is
the number of time steps in an episode, and set our goal as maximising the expected cumulative
reward g = E[R]. We compute an empirical expected cumulative reward g by averaging R over ne
episodes. The classifier πt is trained by first assigning labels {zk }tk=1 to the data observed up to the
current iteration t. For training, the classifier uses the auxiliary dataset:
where the labels are obtained by separating the observed data according to γ ∈ (0, 1) by computing
the γth quantile of {gk }tk=1 :
Note that this is equivalent to acquisition function maximisation in conventional BO and allows the
method to suggest candidate solutions efficiently. For better performance, the maximisation can be
carried out with a global optimisation method (e.g., Arnold and Hansen, 2010).
6
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS
bagging. The number of decision trees should be sufficiently large to reduce classification variance
without increasing the bias. We highlight the study of this method since it is an out-of-the-box
classification method.
4. Experiments
We consider the problem of optimising the function R(x) = g (x) + ϵ with heteroscedastic input-
dependent noise ϵ ∼ N (0, σϵ2 (x)), where σϵ2 (x) denotes an input-dependent noise variance. In
these experiments the variables we optimise are the controller hyper-parameters ϕ, and the variable
ψ that parameterises transition model parameter distributions. We aim at optimising the true noise-
free function g although we only have access to the cumulative reward R. We use R as the objective
to optimise in the simulator experiments. We evaluate the performance of the proposed adaptive
model predictive control framework in several experiments described below.
7
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS
Environment ne T M
Control hyp. Distribution parameter range True parameter
λ ∈ [0.01, 50]
µl ∈ [0.5, 1.6] σl ∈ [0.001, 0.1] l = 1.0
Pendulum 1 10 10 σϵ ∈ [1.0, 10]
λ ∈ [0.01, 1.0]
κ ∈ [0.2, 2.0] κm,σ ∈ [0.001, 0.1] κm = 1.0
Half-Cheetah 18 15 10 σϵ ∈ [0.05, 2.0] m,µ
λ ∈ [0.01, 0.03]
κ ∈ [1.0, 50] κd,σ ∈ [0.001, 0.6] κd = 1.0
Fetchreach 90 12 3 σϵ ∈ [0.001, 0.5] d,µ
xµ ∈ [0.3, 0.32] xσ ∈ [0.001, 0.05] x = 0.3
Panda 10 150 20 λ ∈ [0.01, 2.0] yµ ∈ [0.1, 0.12] yσ ∈ [0.001, 0.01] y = 0.1
zµ ∈ [0.6, 0.62] zσ ∈ [0.001, 0.03] z = 0.6
target location, and a fixed initial robot position. MPPI trajectory evaluations are done on the GPU,
which helped overcome efficiency issues.
The purpose of the Panda task shown in Figure 2 is to reach
the yellow target while avoiding obstacle collision. The obstacle
has true length x = 0.3, width y = 0.1, and height z = 0.6. We
assume partial observability for such obstacle dimension sizes and
attempt to infer them as part of the transition model parameters for
which we define search spaces shown in Table 1. The transition
model parameter l is the rod length for Pendulum, κm is the mass
scaling factor for all the links in Half-Cheetah, and κd is a damping
Figure 2: Panda robot setup
ratio scaling factor for all components in Fetchreach. For Panda,
we optimise the obstacle dimensions x, y, and z. Each transition model parameter is a random
variable parameterised by ψ, for example ψ = {κd,µ , κd,σ } are damping ratio mean and damping
ratio standard deviation for the Fetchreach.
4.2. Method Configuration
We compare the proposed adaptive MPC framework with other surrogate-based methods used in
robotics. All the compared methods are configured as follows.
fier that can deal with the stochasticity of robotic tasks. We com-
−24000
pare two probabilistic classifiers: Random Forest (RF) with 50 deci-
sion trees denoted as BORE-RF, and Multi-layer Perceptron (MLP) −26000
γ = 0.15
1
8
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS
600
−20000
Fetchreach
−1.5 −24000
400 −2.0
−26000
−30000 200 −2.5
−1.5
−3.0 −28000
−40000 0
10 20 30 40 50
Iterations
−2.010 20 30
Iterations
40 50 10 20 30
Iterations
40 50 0 10 20 30 40 50
Iterations
BO methods: We compare the proposed method with the traditional homoscedastic B0homo and
a heteroscedastic BOhetero version from Guzman et al. (2020). We collected 400 data points for
the control and robotic tasks over the search spaces shown in Table 1. Then, using such data, we
optimise BO’s hyper-parameters via maximum GP marginal likelihood optimisation. That marginal
likelihood and the acquisition function are optimised using multi-start L-BFGS-B. We used a UCB
acquisition function hUCB (x) = µ(x) + δσ(x) with balance factor δ = 3.0. For both BO, we used
the anisotropic squared exponential kernel k(x, x′ ) = σn exp −1/2(x − x′ )T diag(ℓ)2 (x − x′ ) .
Other methods: We use TPE optimisation (Bergstra et al., 2011) with quantile value 0.5 since
it is what BORE is based on, and finally, we use covariance matrix adaptation evolution strategy
(CMA-ES) (Arnold and Hansen, 2010) as a non-BO baseline set with σ0 = 10 and population size
2. CMA-ES has been widely used for hyper-parameter tuning in robotics (Modugno et al., 2016;
Sharifzadeh et al., 2021).
9
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS
Method Rmax Rσ λ xµ xσ yµ yσ zµ zσ
BORE-MLP -21106.36 724.10 1.65 0.3104 0.0010 0.1000 0.0010 0.6000 0.0058
BORE-RF -21891.82 291.64 1.67 0.3036 0.0431 0.1001 0.0013 0.6145 0.0234
BOhetero -23025.47 162.22 1.73 0.3185 0.0500 0.1200 0.0028 0.6145 0.0118
BOhomo -22870.14 158.46 2.00 0.3200 0.0010 0.1000 0.0010 0.6200 0.0010
TPE -23438.34 121.26 1.63 0.3124 0.0159 0.1152 0.0045 0.6047 0.0067
CMA-ES -23779.35 481.55 1.47 0.3200 0.0010 0.1200 0.0010 0.6200 0.0142
Table 2: Maximum reward found at the last iteration for the Panda task.
are being outperformed by BORE-MLP and BORE-RF. BORE-MLP converges quite faster to an
optimum in most tasks. In the PANDA environment, the difference is higher, and it suggests that
the proposed framework performs better in higher-dimensional problems.
5. Conclusion
This paper presented an adaptive variant of model predictive control that automatically estimates
model parameter distributions and tunes MPC hyper-parameters within a Bayesian optimisation
framework. In contrast to previous approaches, our formulation is the first to show that global
optimisation can be accomplished by learning a classifier that estimates density ratios. We studied
the empirical performance of the framework with different classifiers and against benchmark BO
versions. The proposed method was able to surpass the performance of the traditional BO and a
heteroscedastic BO variation. Our results indicate the flexibility of using density-ratio estimation to
optimise MPC and how it can impact the performance of MPC in control and robotic tasks under
dynamics model uncertainty. Future research directions include obtaining theoretical results on
the effects of heteroscedasticity, and we could explore alternative non-normal distributions for the
actions that could be more suitable for the tasks.
10
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS
References
Dirk V. Arnold and Nikolaus Hansen. Active covariance matrix adaptation for the (1+1)-CMA-
ES. In Proceedings of the 12th annual conference on Genetic and evolutionary computation -
GECCO ’10, page 385, Portland, OR, 2010. ACM.
James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for Hyper-
parameter Optimization. In Advances in Neural Information Processing Systems, pages 2546–
2554, 2011.
Mohak Bhardwaj, Balakumar Sundaralingam, Arsalan Mousavian, Nathan D Ratliff, Dieter Fox,
Fabio Ramos, and Byron Boots. Storm: An integrated framework for fast joint-space model-
predictive control for reactive manipulation. In 5th Annual Conference on Robot Learning, 2021.
Adam D. Bull. Convergence Rates of Efficient Global Optimization Algorithms. Journal of Machine
Learning Research (JMLR), 12:2879–2904, 2011.
Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep rein-
forcement learning for continuous control. 33rd International Conference on Machine Learning,
ICML 2016, 3:2001–2014, 2016.
Daniel Görges. Relations between Model Predictive Control and Reinforcement Learning. IFAC-
PapersOnLine, 50(1):4920–4928, 2017. ISSN 24058963. doi: 10.1016/j.ifacol.2017.08.747.
Rel Guzman, Rafael Oliveira, and Fabio Ramos. Heteroscedastic Bayesian Optimisation for
Stochastic Model Predictive Control. IEEE Robotics and Automation Letters, 6(1):1–1, 2020.
doi: 10.1109/lra.2020.3028830.
Hilbert J Kappen. Linear theory for control of nonlinear stochastic systems. Physical review letters,
95(20):200201, 2005.
Kimin Lee, Younggyo Seo, Seunghyun Lee, Honglak Lee, and Jinwoo Shin. Context-aware Dy-
namics Model for Generalization in Model-Based Reinforcement Learning. 2020.
Valerio Modugno, Gerard Neumann, Elmar Rueckert, Giuseppe Oriolo, Jan Peters, and Serena
Ivaldi. Learning soft task priorities for control of redundant robots. In 2016 IEEE International
Conference on Robotics and Automation (ICRA), pages 221–226. IEEE, 2016.
Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-Real Transfer
of Robotic Control with Dynamics Randomization. In 2018 IEEE International Conference on
Robotics and Automation (ICRA), pages 3803–3810, Brisbane, Australia, 2018.
Fabio Ramos, Rafael Carvalhaes Possas, and Dieter Fox. BayesSim : adaptive domain randomiza-
tion via probabilistic inference for robotics simulators. In Robotics: Science and Systems (RSS),
Freiburg im Breisgau, Germany, 2019.
Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning.
2006. ISBN 026218253X.
11
A DAPTIVE M ODEL P REDICTIVE C ONTROL BY L EARNING C LASSIFIERS
Nicholas Roy, Paul Newman, and Siddhartha Srinivasa. Variational bayesian optimization for run-
time risk-sensitive control. 2013.
Mohammad Sharifzadeh, Yuhao Jiang, Amir Salimi Lafmejani, Kevin Nichols, and Daniel Aukes.
Maneuverable gait selection for a novel fish-inspired robot using a cma-es-assisted workflow.
Bioinspiration & Biomimetics, 16(5):056017, 2021.
Farshud Sorourifar, Georgios Makrygirgos, Ali Mesbah, and Joel A Paulson. A data-driven auto-
matic tuning method for mpc under uncertainty using constrained bayesian optimization. IFAC-
PapersOnLine, 54(3):243–250, 2021.
Louis C Tiao, Aaron Klein, Cédric Archambeau, Edwin V Bonilla, Matthias Seeger, and Fabio
Ramos. Bayesian optimization by density ratio estimation. In Proceedings of the 38th Interna-
tional Conference on Machine Learning (ICML2021), Virtual (Online), 2021.
Zilong Wang and Marianthi Ierapetritou. A novel surrogate-based optimization method for black-
box simulation with heteroscedastic noise. Industrial & Engineering Chemistry Research, 56
(38):10720–10732, 2017. doi: 10.1021/acs.iecr.7b00867.
Grady Williams, Paul Drews, Brian Goldfain, James M. Rehg, and Evangelos A. Theodorou. Ag-
gressive driving with model predictive path integral control. Proceedings - IEEE International
Conference on Robotics and Automation, 2016-June:1433–1440, 2016. ISSN 10504729. doi:
10.1109/ICRA.2016.7487277.
Grady Williams, Paul Drews, Brian Goldfain, James M. Rehg, and IEvangelos A. Theodorou.
Information-Theoretic Model Predictive Control: Theory and Applications to Autonomous
Driving. IEEE Transactions on Robotics, 34(6):1603–1622, 2018a. ISSN 15523098. doi:
10.1109/TRO.2018.2865891.
Grady Williams, Brian Goldfain, Paul Drews, Kamil Saigol, James Rehg, and Evangelos
Theodorou. Robust Sampling Based Model Predictive Control with Sparse Objective Informa-
tion. (c), 2018b. doi: 10.15607/rss.2018.xiv.042.
12