Reinforcement Learning and Optimization-based
Control
                   Assoc. Prof. Dr. Emre Koyuncu
                    Department of Aeronautics Engineering
                        Istanbul Technical University
                        Lecture 1: Introduction
E. Koyuncu (ITU)                 RL and ObC                 Lecture 1   1 / 34
Table of Contents
1   Optimal Control and RL
2   Adaptive Control
3   Reinforcement Learning
4   RL Applications
5   About this Course
       E. Koyuncu (ITU)      RL and ObC   Lecture 1   2 / 34
Table of Contents
1   Optimal Control and RL
2   Adaptive Control
3   Reinforcement Learning
4   RL Applications
5   About this Course
       E. Koyuncu (ITU)      RL and ObC   Lecture 1   3 / 34
Adaptive and Optimal Control
Optimal Control
 • Minimize prescribed               Adaptive Control
    performance function              • Learns online via feedback
 • Usually designed to be offline to    function
    solve HJB                         • Not usually designed to be
 • Use complete knowledge of the        optimal
    system                            • First identify the system then
 • Solving Nonlinear HJB equation       use the model
    are often hard or impossible
     E. Koyuncu (ITU)            RL and ObC                   Lecture 1   4 / 34
MPC and RL
   • Both are frameworks to solve sequential decision making problems
   • Both automatically design controllers based on desired outcomes
     (reward/cost, constraints, etc.)
Reinforcement Learning
  • Controller directly learned from Model Predictive Control
    data, exploration and               • System identification precedes
    exploitation                           control implementation, model
  • Both continuous and                    fixed during execution
    binary/sparse rewards               • Typically convex stage costs
  • Constraints imposed via               • Constraints imposed explicitly
    penalties                             • Online optimization over
  • Mostly parameterized controller,        prediction horizon - expensive?
    Deep Learning integrated cheap        • Usually combined with state
  • Usually history included in             estimator
    definition of the state
     E. Koyuncu (ITU)            RL and ObC                      Lecture 1   5 / 34
Linear Quadratic Regulators(LQR)
The most basic sort of optimal controller for LTI systems. Consider
following system
                           ẋ = Ax(t) + Bu(t)
where the state x(t) ∈ Rn and control input u(t) ∈ Rm . The system is
associated with the infinite horizon quadratic cost function
                              Z ∞
            V (x(t0 ), t0 ) =     (x T (τ )Qx(τ ) + u T Ru(τ ))d(τ )
                             t0
with weighting matrices Q ≥ 0, R ≥ 0.
  • it is assumed that (A, B) stabilizable - there exist a control input
     makes the system stable
         √
  • (A, Q) is detectable - unstable modes are observable through
                 √
     output (y = Qx)
     E. Koyuncu (ITU)             RL and ObC                     Lecture 1   6 / 34
Linear Quadratic Regulators(LQR)
The LQR optimal control problem requires finding the policy that
minimizes the cost
                        u ∗ (t) = arg min V (t0 , x (t0 ) , u(t))
                                        u(t)
                                     t0 ≤t≤∞
The solution is given by u(t) = −Kx(t), where the gain matrix will be
                                    K = R −1 B T P
where P matrix is a positive definite solution of Algebraic Riccati Equation
                         AT P + PA + Q − PBR −1 B T P = 0
  • under stabilizabiltiy and detectability conditions there is a unique
    positive semi-definite solution
  • this is closed loop system A − BK is asymptotically stable
  • this is offline solution requires complete knowledge on the system
    dynamics
     E. Koyuncu (ITU)                   RL and ObC                  Lecture 1   7 / 34
Linear Quadratic Zero-sum Games
The LQ-ZS games have following linear dynamics
                         ẋ = Ax(t) + Bu(t) + Dd
where the state x(t) ∈ Rn , control input u(t) ∈ Rm , and disturbance
d(t) ∈ Rk . The system is associated with the infinite horizon quadratic
cost function
                 1 ∞ T
                  Z                                       Z ∞
                                    T       2    2
V (x(t), u, d) =        x Qx + u Ru − γ kdk dτ ≡               r (x, u, d)dτ
                 2 t                                        t
with the control weighting matrix R = R T > 0 and a scalar λ > 0.
     E. Koyuncu (ITU)             RL and ObC                     Lecture 1   8 / 34
Linear Quadratic Zero-sum Games
The LQ-ZS games require finding the control policy that minimizes the
cost wrt the control and maximizes the cost wrt to the disturbance
           V ∗ (x(0)) = min max J(x(0), u, d)
                         u   d
                                Z ∞                          
                      = min max       Q(x) + u T Ru − γ 2 kdk2 dt
                        u   d   0
The solution of this optimal control problem is given by
                        u(x) = −R −1 B T Px = −Kx
                                1
                        d(x) = 2 D T Px = Lx
                               γ
where the P is the solution to the game ARE
                                                    1
               0 = AT P + PA + Q − PBR −1 B T P +      PDD T P
                                                    γ2
     E. Koyuncu (ITU)               RL and ObC                   Lecture 1   9 / 34
Linear Quadratic Zero-sum Games
                                                               √
  • There exist a solution P > 0 if (A, B) is stabilizable, (A, Q) is
    observable, and λ > λ∗ the H-infinity gain.
  • this is offline solution that requires complete knowledge of the system
    dynamics (A, B, D)
  • if system dynamics (A, B, D) change or the performance index
    (Q, R, λ) varies, a new optimal control solution needed.
    E. Koyuncu (ITU)             RL and ObC                    Lecture 1   10 / 34
Table of Contents
1   Optimal Control and RL
2   Adaptive Control
3   Reinforcement Learning
4   RL Applications
5   About this Course
       E. Koyuncu (ITU)      RL and ObC   Lecture 1   11 / 34
Model Reference Adaptive Controller (MRAC)
Consider the simple scalar case
                                  ẋ = ax + bu
where the state x(t) ∈ Rn , control input u(t) ∈ Rm , and input gain b > 0.
It is desired for the plant state to follow the state of a reference model
given by
                              ẋm = −am xm + bm r
where r (t) ∈ Rn reference input signal. Take the controller structure as
                              u = −kx + dr
which has a feedback term and a feedforward term. k and d are unknown
and are to be determined so that the state tracking error
e(t) = x(t) − xm (t) goes to zero.
     E. Koyuncu (ITU)               RL and ObC                  Lecture 1   12 / 34
Model Reference Adaptive Controller (MRAC)
    E. Koyuncu (ITU)    RL and ObC           Lecture 1   13 / 34
Model Reference Adaptive Controller (MRAC)
Tune the controller parameters online. E.g., using Lyapunov techniques,
the parameters are tune wrt
                          k̇ = αex,    d˙ = −βer
where α, β > 0 are tuning parameters, then the tracking error e(t) goes to
zero with time.
  • the feedback gain k is tuned by a product of its state x(t) in the
     traking error e(t)
  • feedforward gain d is tuned by a product of its input r (t) in the
     traking error e(t)
  • the plant dynamics (a, b) are not needed in the tuning laws!
     E. Koyuncu (ITU)            RL and ObC                   Lecture 1   14 / 34
Table of Contents
1   Optimal Control and RL
2   Adaptive Control
3   Reinforcement Learning
4   RL Applications
5   About this Course
       E. Koyuncu (ITU)      RL and ObC   Lecture 1   15 / 34
Reinforcement Learning
RL has close connections to both optimal and adaptive control.
  • allows designing adaptive controllers learn online and in real time
  • provide solutions to user prescribed optimal control problems.
E.g., actor-critic structure
  •   policy evaluation executed by the critic
  •   policy improvement preformed by the actor.
  •   determine how close to optimal the current action
  •   modify control policy yields a value function.
      E. Koyuncu (ITU)            RL and ObC                    Lecture 1   16 / 34
AI/RL vs Control Terminology
RL uses max value, Control uses min cost
  • Reward of a stage → Cost of a stage
  • State value → State cost
  • Value function → Cost function
System terminology
  • Agent → Controller or decision maker
  • Action → Control or decision
  • Environment → Dynamic system
Learning/Planning terminology
  • Learning → Solving a problem with simulation
  • Self-learning → Solving problem with simulation-based policy iteration
  • Planning vs Learning → Solving problem with model-based or
    model-free simulations
    E. Koyuncu (ITU)             RL and ObC                   Lecture 1   17 / 34
Value Functions
  • Value functions measure the goodness of a particular state or
    state/action pair: how good is for the agent to be in a particular state
    or execute a particular action at a particular state, for a given policy.
  • Optimal value functions measure the best possible goodness of states
    or state/action pairs under all possible policies.
  • Prediction: For a given policy, estimate state and state/action value
    functions
  • Control (Optimal): Estimate the optimal state and state/action value
    functions
     E. Koyuncu (ITU)            RL and ObC                    Lecture 1 18 / 34
Sequential Decision
Optimal decision
  • At current state, apply decision that minimizes
    Current stage cost + J ∗ (Next state)
    where J ∗ (Next state) is the optimal future cost, starting from next
    state
  • This defines optimal policy - an optimal control to apply at each state
    E. Koyuncu (ITU)             RL and ObC                    Lecture 1   19 / 34
Principle of Optimality
Principle of optimality
Let {u0∗ , ..., uN−1
                 ∗   } be an optimal control sequence wrt state sequence
{x0 , ..., xN }. Consider the tail subproblem that starts at xk∗ at time k and
  ∗         ∗
minimizes over {uk , ..., uN−1 } the cost-to-go from k to N
                                          N−1
                                          X
                    gk (xk∗ , uk )   +           gm (xm , um ) + gN (xN )
                                         m=k+1
Then the tail optimal control sequence {uk∗ , ..., uN−1
                                                    ∗   } is optimal for the
tail subproblem.
     E. Koyuncu (ITU)                       RL and ObC                      Lecture 1   20 / 34
Dynamic Programming
Solve all the tail subproblems of a given time length using the solution of
all the tail subproblems of shorter time length.
By principle of optimality
  • Consider every possible uk and solve the tail subproblem that starts at
    next state xk+1 = fk (xk , uk )
  • Optimize over all uk
By principle of optimality
Start with
                              JN∗ (xN ) = gN (xN ) ,         for all xN
and for k = 0,          , N − 1, let
     Jk∗ (xk ) =                                       ∗
                                                                         
                        min           gk (xk , uk ) + Jk+1 (fk (xk , uk )) ,   for all xk .
                   uk ∈Uk (xk )
then optimal cost J ∗ (x0 ) is obtained at the last step: J0 (x0 ) = J ∗ (x0 )
     E. Koyuncu (ITU)                           RL and ObC                        Lecture 1   21 / 34
Constraints via Infinite Cost Values
Can assign infinite cost to infeasible points, using extended reals
R := R ∪ {∞, −∞}
                                                                  Equivalent Unconstrained
Constrained Optimal Control                                       Formulation
Problem                                                                  min
                                                                                N−1
                                                                                X
                                                                                      c̄ (sk , ak ) + Ē (sN )
                                                                         s,a
                                                                                k=0
          PN−1                                                           s.t.    s0 = s̄0
   mins,a    k=0
                    c (sk , ak ) + E (sN )
        s.t.   s0   = s̄0                                                    sk+1 = f (sk , ak ) , k = 0, . . . , N − 1
             sk+1   = f (sk , ak )                                                       (                          )
                                                                                            c(s, a) if h(s, a) ≤ 0
                0   ≥ h (sk , ak ) , k = 0, . . . , N − 1                with c̄(s, a) =
                0   ≥ r (sN )                                                                    ∞ else
                                                                                       (                    )
                                                                                         E (s) if r (s) ≤ 0
                                                                         and Ē (s) =
                                                                                           ∞ else
       E. Koyuncu (ITU)                                     RL and ObC                                           Lecture 1   22 / 34
Model-free VS Model-based
George Box
”All models are wrong but some models are useful”
  • Due to model error model-free methods often achieve better policies
    though are more time consuming
  • (Adaptivity) We will examine use of (inaccurate) learned models and
    ways not to hinder the final policy while still accelerating learning
     E. Koyuncu (ITU)            RL and ObC                   Lecture 1   23 / 34
Bellman’s curse of dimensionality
  • Exact Dynamic Programming is an elegant and powerful way to solve
    any optimal control problem to global optimality, independent of
    convexity. It can be interpreted an efficient implementation of an
    exhaustive search that explores all possible control actions for all
    possible circumstances.
  • However, it requires the tabulation of cost-to-go functions for all
    possible states s ∈ S. Thus, it is exactly implementable only for
    discrete state and action spaces, and otherwise requires a
    discretization of the state space. Its computational complexity grows
    exponentially in the state dimension. This ”curse of dimensionality”,
    a phrase coined by Richard Bellman, unfortunately makes exact DP
    impossible to appy to systems with larger state dimensions.
  • Classical MPC does circumvent this problem by restricting itself to
    finding only the optimal trajectory that starts at the current state s0.
  • Explicit MPC suffers from the same curse of dimensionality as DP.
     E. Koyuncu (ITU)             RL and ObC                    Lecture 1   24 / 34
Table of Contents
1   Optimal Control and RL
2   Adaptive Control
3   Reinforcement Learning
4   RL Applications
5   About this Course
       E. Koyuncu (ITU)      RL and ObC   Lecture 1   25 / 34
Reinforcement Learning History
Historical highlights
  • Exact DP, Optimal Control - Bellman, Shannon, others 1950s
  • AI/RL and Decision Making ideas - late 80s and early 90s
  • Backgammon programs - Tesauro, 1992
  • Algorithm era, analysis, applications, books - mid 90s
  • Machine Learning, Big Data, Neural Networks - mid 2000s
  • AlphaGo and AlphaZero - Deepmind, 2016, 2017
  • DARPA AlphaDogFight against real F-16 pilots - 2019, 2020
     E. Koyuncu (ITU)          RL and ObC                 Lecture 1   26 / 34
Multiagent Reinforcement Learning
OpenAI Hide and Seek game with emergent behaviours
          https://openai.com/blog/emergent-tool-use
           https://www.youtube.com/watch?v=kopoLzvh5jY
    E. Koyuncu (ITU)         RL and ObC               Lecture 1   27 / 34
RL-based Strategical War Gaming
  • Survivability based Optimal Air Combat Mission Planning with Reinforcement Learning, IEEE Conference on Control
    Technology and Applications (CCTA), Copenhagen, Denmark, August 21-24, 2018, Baspinar, B., Koyuncu, E.,
      E. Koyuncu (ITU)                             RL and ObC                                      Lecture 1     28 / 34
RL-based Tactical Air Combat
  • Assessment of Aerial Combat Game via Optimization-Based Receding Horizon Control, IEEE Access, vol. 8, pp.
    35853-35863, 2020, doi: 10.1109/ACCESS.2020.2974792 Baspinar, B., Koyuncu, E.,
  • Evaluation of Two-vs-One Air Combats Using Hybrid Maneuver-Based Framework and Security Strategy Approach,
    Journal of Aeronautics and Space Technologies, v. 12-1, pg. 95-107, January 2019 Baspinar, B., Koyuncu, E.,
  • Differential Flatness-based Optimal Air Combat Maneuver Strategy Generation, AIAA Science and Technology Forum
    and Exposition (AIAA SciTech 2019), San Diego, California, 7-11 January 2019 Baspinar B., Koyuncu E.,
  • Aerial Combat Simulation Environment for One-on-One Engagement, AIAA SciTech Forum and Exposition: Modelling
    and Simulation Technologies, Gaylord Palms, Kissimmee, FL, 8-12 January 2018 Baspinar, B., Koyuncu, E.,
      E. Koyuncu (ITU)                            RL and ObC                                     Lecture 1     29 / 34
RL-based Fast Flight Replanning
https://www.youtube.com/watch?v=8IiLQFQ3V0E
  • A Dynamically Feasible Fast Replanning Strategy with Deep Reinforcement Learning, Journal of Intelligent and Robotic
    Systems, v. 101, issue 1, 2021 Hasanzade, M., Koyuncu, E.,
      E. Koyuncu (ITU)                              RL and ObC                                       Lecture 1     30 / 34
Table of Contents
1   Optimal Control and RL
2   Adaptive Control
3   Reinforcement Learning
4   RL Applications
5   About this Course
       E. Koyuncu (ITU)      RL and ObC   Lecture 1   31 / 34
Course Topics
  • Introduction; Optimal Control; Adaptive Control and RL
  • RL and Optimal Control of Discrete Systems
  • RL-based Optimal Adaptive Control for Linear Systems
  • RL-based Optimal Adaptive Control for Nonlinear Systems
  • Policy iteration for continuous-time systems
  • Value iteration for continuous-time systems
  • RL-based Optimal Adaptive Control with Online Learning
  • Online Learning for Zero-sum Games and H-infinity Control
  • Online Learning for mutiplayer non-zero-sum Games
  • RL for Zero-sum Games
    E. Koyuncu (ITU)           RL and ObC                     Lecture 1   32 / 34
Grading Policy
  • 20% Paper abstract - problem selection and presentation, in class,
    Due date is April 15.
  • 40% Submission ready paper - 6 pages, including coding
    implementation, in IFAC CPHS template - Due date is May 15, strict.
  • 40% Paper presentation, including coding implementation - online, in
    final exam week.
  • 1 to 3 people groups
    E. Koyuncu (ITU)            RL and ObC                  Lecture 1   33 / 34
IFAC CPHS 2024, Antalya Turkey
    E. Koyuncu (ITU)   RL and ObC   Lecture 1   34 / 34