RL DQN PG
RL DQN PG
             dated              label
                                                    good    learn some
            Goal Learn a fnthal                underlying hidden
       can    map                              relationship believers
                            y                   the data that can
                                                he used to enflose
 enamples     classification Cart vsDog           its structure ofdata
                 Semantic segmentation        examples     clustering
Object detection
                                              dimensionality reduction
Regression     fee YJ       image             feature learny destiny
                                                estimation
Calption
             Anything in which
             target is known                 tasgeeh   Fhdatfeqt
               Injun.at
                        qst                  data    probability
                                                                  m
                                             tonights.mg
                                                 It   is   a    learning
       Reinforcement learning
                                            paradigm in which
    learning happens by           enfloration So                agent
    interact with the environment repeatedly w o
                         and labels  Wrt the environnt
     any prior knowledge
              Completely    relying
                               on   hit and trial
                                                behave optimally
      strategy        to learn howto
      within that       environment
                         Agent
                           A
 Stsaehzo
            fRewsqdIYAuoctiaonNentstateCSe.D
         t                                      5
                            Qld        a
QfnGiImay                                            t
                                                      affirm
             as              or
 B            Basic RL                   setupframework
                               Agent
                             environt
                   finally
  stsfso.iq
         IgEEFfRwsEFm
                                        State se       choosen
                                        actioncats agent
                                                                    Etf
Environment                                          cord          Jn return depending
                                         will
     the agent                                  eeomaqf.tw       upon its state agent
gives                                                            takes   some action
somestatti                                             L
                        Environment
                                                E
 Episodic training                 o
                                         These steps keeps on
                                                        six
       Markov property
                                               the
 The    future is independentof
 Past gmen           the    present
Yt
        Pfs Is             e        PIs    Is.si     set
                        S           A           P          R        r
Set        of
                    b
                              set                ti
Fattest                   Porthos               EEmatsify.ge
                                                               TTnsg7qda.Y
Si Sz           St        9,92          Ah
  SN                            an              Sst At
                                                  t
                                                  Stat
          Lse Ats                                                 Discount
                                                                   foot
      I                  4                        How much         current
Sttl                    out          as   comprwomdsftusem.to't
             State    transition Mahin P
Markov processfchain is   a
                              memory less
random process of Sampling iteratively
 as per the
            ginen
                   ST M P    Starbug
from some gmen            seed      random
          state and           Markov property
 starting           following
Full game dynamics            can   be   encoded
within    P       Robot can be interacting
 with    with the           environment
                    gaming
 following            The game rules
           reward    agent  is
 Amount of        an
      to          the environment
going    get from
  when                  at     state                agent
      takes        an        action         a
         rat                  r        St         at
              St
                                                          Greening
 But          in         M DP          we   are     not
interested in immediate rewards
 An optimal behaviour must need to
             the Average Discounted
maximize
 Commutative Future Reward DC FR
man     IE    DC FR                                          mining
                              E                 o
                                                    a
                                                        E
         se o
How    MDP operates and can
be used to obtain Episodes
                                           MDP       Execution
     At     time                    o     environment
may sample randomly                                        an
initial State
                                    Soup             Cso
      Initial     state
                                                     IS PD
     Probability distribution
Using     some
                  policy            that        we    need
to          out
   figure             So       as   to         Max      ECKERD
                upon the agent's
     Defending
    State      and the action
taken                    environment      will
return                           the    immediate
    pass
                    Pfs is f         SES AE 9
St     P        Rsa
           g               Sth
           action            Policy
A policy              is              mapping
                           just
                                  a
of   States            to actions
                             to       take
enabling any agent
 action
                prostate          Raction
It   can       be deterministic       or
Co2
            Goalof               MDP
Can                      the       DC FR
         maximizes
 The      actual     returns       are   always
in the    form of          DC FR
 Ge        Discounted cummulature
           future Reward from time
            Stelp     fge
                        ownfefffuyardswareaguted
                     close to                         Myopia
                                                      evaluation
                                                  More importance
               close to                           to immediate
                                                   returns
               far
           sighted
   Care about all future rewards
     Requirementofdiscount
                                                          about
     future is uncertain and we                  cure               o
                                   50
                                                        re    lo        i
          y5          glo         y
      Since     we   are                                 inaccurate
                           predicting
                               using
                                                  an
 overall                                                                    policy
the policies                        where     Sou p f So
       statestaligaFactionmittefnentm
     distribution
                                              see
                                                         in
                                                                Pf IL             Stitt
 vacs                                       r
                                                   H
                                      or
                             EfE
Value of state
            him         Effeded
9defy't                             Tetffrom state
                                                           to
                           followingthe policy
                          choose
                                    future actions
                                  Value function                    sat
      D    2        7
                            Q
Let       us       assume    that       we    have
  5                                     A                     for some
               9        2,53                     9,92
                                             actionpaissandusatah.IQ
 wecgg.hqeinemzyostak
                         52,9                53,9         205621
      Si Az                                                Q sails
            52,9                            Sz Az       xca.lsdh
                                                      acaslsd4
                                                        axCailszjfxxf.azszJ
                               action    at state
Basically                      then     following
the enpected DCFR                                              action
                                        fat   state
has been            choosen              MEER
 QTs.at
Used for
                                fEa.oTnttxlsaEEa'I
                                at State
policy optimization
                            affection
                                            ketonfouow
              Equivalently
                                           sa
oils.at              Efz.int
                                               IEaI
                               O
                                           t final
              t.TT
crowd
E ttfinal t
                     Oj                f get
      Gotthtrka                    x
  1
      Psst
Do3    Optimal Value  function
                value
               and   Q        f   n
      4 67                    Vals
                     Maan
                 m   an
                          EEE or're
                                                HEMI
 QIs.at                        rtrx.name
         Qualm                                    sick
  best                          Cssa
                                        oitcs.ae
            rtV           Q     sfa
Since       at state                            taken            action
                            s
                                    after                   an
oils a
                       s
                            E
                            n       E
                                        Ert             r
        reach
 it may
to a Set of
                            Max             Oils        a          sa
statesce A
 defined recursively
  Expected values are Considered in order
to   address the randomness in the forest
     Just     see       it    once more             as   we   are
                                                                    going
                                                                            to   use
       reach
it may                        Max               Q fs                a   1        99
to a Set of
 States E                         A
                                  s
                                       G  Ann        Bestcai
                   a                                                SE      si sa
                                                                        E 195253
          S         a
                                  S2      Eur        Best     as
                                                                        a
               a
                                                                            asshown
                                                         E     enfectedoalus
                              3X                         is the average of
                                  Best as
 basically whatever                                       all the
                             we   observe
                                                aftertaking             a   decision
         at state
             4                                        Tht
         D
                       Optimal                 Policy
      Policy Defeo
        MDP policies depends                   over      current state
                                        only            C Not on History
      Meme policies          are
                                   stationary Time       independent
 T1      als             PEA al                             49
                                                                 ai
                                               s
                                   E      SE
                             these ties   we
                                                           Yaz
                       are                          ay
  egocheepidoli
                           deciding under policy
        suitable      arsing to be given Later we need to        Compute
ensures
     actionspade       Items                   DP    Not scalable
state                        optimally
                     heme appronimalt litusing
erygogation                                usingpolicy network
     Policy
                             Iterative update
                                                     rays
              At                  alE3sY
                                                              IT3sYJ.s
 State
Transition
                    n                  Dio           5                   Aza
as por                            12
                                                                             14
a io
i i
                15                        16             17                             18
    vacs                          Vacsz        Viscss                    Viacsu
Crites                   is
                                  HaldKIII sig.YSg
                                                                              Ij
                                                                                   ij
 ssuming
      atAfrastVZsesPass.x Yak'D
    Inclosed form
0
              VK.ii R                  pp.TV
Value vector for
                             l               47
                                           stale          falue
                            Reward         Transition      fn
    all States at
                            function        probability
                                                           at
    CKtDthiteration         asperthe                       Kth
                              MDP                         iteratin
Let us evaluate         a   random policy in small
  grid world
    E 2       Policy Evolution              Example
F 2       Deep Q learning for Qfn approximation
      Q Cs a              O       x   oils          a
                         A
             using inn parameters
           to estimate the Qoalus           Doffearningly
            Ser CS a
                         pair
     We    need   a    Q fn appronimator that satisfies Bellman
                                Bellman optimality at
     equation by      enforcing
                                            Intuition
     each iterative step
                                                        i
                          t7Eo    i saoei
          s.me
                                                             is.g
                                                                 l
                                          irmaainQlsiasi
                      go.esaeer
                                                                Qt.fsiaiq
          Solving for Optimal Policy
                                            Directly
   G I Introduction to policy gradient
    Instead of           DNN to approximate
                  using
 Q Value function why  can t
                             directly learn the
                                                           0
 Suitable policy       Ti    parameterized           by
           Class of parameterized               A         10 C
                                                       To     Rmg
              policies
ftp.neam.f.fmwue
                            JHKEIE.ortn.to
                            L              11
                                            x             Hi
           value of the policy            Ewewp.ectedDcmpaeecyIa
           Parameterized by
                                                much DC FR        on
ttOGivenanypolicy.CTho how
     an                one      can   entract            value of
           average                                      the IJ lo
GOAL DNN        need to optimize for the parameter 0
Is
1
DQ N   Learning Qfn
             using   samentw
                  samentw   may
Target NIM     introduce
                         instability
Find A   m
                      t   S
                                       Environment              e
     O Main who            t                                          6
                                                       thistenniffrience
R.qnm.dq.maleey.my
                          z                            metre replay
                                                       buffer                    I
                                               3
i
gg53,93
n
    0 Predicted
                              aeargmjexQOGMJIRetft.asum.mg
                               Choose action usingEpsilon                  it have enough
                                  greedy policy                        samples
      Q Values
                                                             7 Sample a random
         Su Ag     Io                      BATCH               batch of transition
             currentstate                  51,9 G S
      12     and actiongoes
                                           5219492,52
Y           ssigaiygoes.to
            main network for            53193,923153            Thement stale Ss
  Yi          its Q value
  Yz                                                            Goesto target network
                     Prediction
  43
                                                            1        O   for the
                                           Si Ai Ri Si
target    Lfo                                                       target value computaters
values                                                                using bootstrapping
            4         0,0 15 Su anguish
                                                                8
                                                                          or   Bellman
    Yu                                                                     Unrolling
            g                                                       QOD
     O'CTargetnao                              q.tt      naaY
It is           the time
                                       y                                        mainnetwork
                                                                                 updation
         just                              0       0    4             40
delayed copy         of
    DEEP                          12
                                           O           Copy the parameters of
                                                        the main network to target
                                  after Sometime                    Iterations
         Learning
      Policy Gradient
up
                    i
                    R
ped
Jn RL we need to learn
    Q In        parameterization   G
              parameterization     O
   Policy
        we     directly learn
        the     policy
Intention    o
   In PG          we     use     N N   to
approximate            the     optimal policy a't
      Initialize 0           randomly
      feed state                       as   an
       we
probability distribution
       Also Stone                  S A   R s
             until the
                     of       end        the episode
                            data
  training
                                         that
        If          agent      win
   t    as   stated in
        step
  G    2    REINFORCE algorithm                        Loss function
                              E               so au no     c si ai     ri
                      e
    Trajectory                                  sash
       Sample trajectory                                                          g
                                                         sampled randomly
  from     the example episode     of game play
                              plz lol
                ronwrczz
Gamefaunihiaemf.TT
            Toledo       r                                   Zz
    Assuming the
                    meee
                      agent to      be             need
     game state and                      startingfrom some            initial
                              following To policy parameter
                    Assuming all trajectories to be                  over     O
                    equally probable J O
                                                           If               rti
  considering an
                                KG               P Glo      de
trajectories that one
can   encounter while
          0             argmfenff
                                           O
       Forward        Pass       J 8
                                            GG       plzlol de
   Backward
               pass          J 8
                        Tf                  red
                                                   To PGH de
             This is intractable
                                 Gradient               I
glomerulonephritides
   Cowhenmbulationgofdependsentedation
Trier
Putting eq
                y
                    fptelos              pklolx F.pe
                                                    TI.o
                into eq
                                    p Glo         To log EPHOD
        J O
  To
                    L   r    z
                                   To log PGIof PG lol de
   i   To JL E
 Enfectation
                            free         FologEPGLOB
                zu      plz 10
 of Gradient
    G     4    Issue           with Gradient of its Computation
       sampling of trajectories
      All trajectories 7,22        Zo for Some
                                                  gum o
      are not equiprobable instead
                                      following a distribution
      Hence
                                               say p Glo
               for a   some trajectory
                                             sayfi sso.fi aYfg9jY5f
    Let us     compute the
                                   probabilityof
     Dgiven1pcz1JpCziI0J_fTCs.lso.ao
                                CaoKos
                                                                  To
 given    the policy Cao                           S
                                           T S2        9      Tofu IsD
parameterized by       d
learned
          by   some
                                           T Sz Sz      Aa
Deep policy                                                    To      as 152
            Nfo manimigng
the reward
               fn HOD
                                       Trpgnsf
                                            tiof.nu               meFIFsed
                                                              p              no
              plz 10                 IT          T Ste           Ist        at
Estimating the      probability                                                      F9 A
               tf
j f To att se
                                                  Probability of                      on
        The transition                    www.ueae                      taking
                                                                 ffhy
                                                             spatiyslah
  Lat at state
               St
                               Tho      by maximizing                   d        O
     estimating
    Now the question is Can we compute J O
    without transition probabilities T
    But for backpoopagation                only
                                       J O         To
     gradient                     J O       is
                          of                     required and it is
        not                                              T
                    depending upon
      AS         already     shown
       i    To
    Enfectation
                           E    free           FologEPGLOB
    of Gradient       empty         0
                                                                 Eq    B
  logfkklof fzologTGt.lst.at                                                1
Just               Independent                                    of
differentiating
e wot      lol                               dog Ig       at Ise           CEge
Dependent over O
                                            Tologkolattsef
     J O           IE        feces
To
               z   n
                        p Glo
                                 seeds hwEaTw
EtToD        man
                                            no
                         a
G to log totals D t
Ge To log totals t
                            T
                            _gQ
                                     M              oo
     1
To   Jlo          I                  G     Fologtaocaetsed
                            or H
                            I
                                          FologGoCatstD
     Ezo              Es.IE                         we
                                                     I
                       Ii
rt Game
     E       o        that
                                     for
Playing game
                      moored
                        F         KEELEY
I
    w   t             THE EdisonB
O9
                Tnn                        is
                                              lankain
            E
                                             think
                      Exploration
                            it may
                       and minima
                        local
                            MafHEfz.m
EH                      E
                Enbokr6on
                                     Eerie
fogeloisode                          Average
                                    AarT
      gradiutvariane                             Yw
 High              Santas
  confloration
      Correlated       Finish properly all the
                       limitations issues
                                           of
                        REINFORCE algorithm
          Algorithm     for Policy Gradient           REINFORCE
               Foto                          onto
                                                     log tofaith
                                                          RG
             update the network parameter
                    0     Ot L Tg J O
              Actually     we   are
                                             policy to generate
                                       using
              the trajectory and then
                                      computing                   lo
              to update the
                            policy itself
                    This    will in turn improve the policy
                after      each iteration
                                      Hence   returns
                                                        vary
                                      greatly introducing
                                              variance in
                                  high                      the
                                       gradient    updates
            Policy gradient with reward            to Go
for Vanila PG
popp
                            0k
                                       EI          otoqaopfq.IS
                                                                  D
                    where
  ftp.gY                        RG            rt
                                         to
 hit   us
            say Rewasd     to Go defined      as
                                                   Rt
   Sum        of the   rewards
                                  of the   trajectory starting
   from       the State    Se
                             T l
             Rt               2          r    s
                                                  ti
                                                         a
                                                             e
                         t         t                      An action is only
   Instead of RG                                        7tManasgen
                                       use    Rt                      7hd
  vostok at                                  ohqfof.LY
                                                                        I
Baseline
                     is the value that can give
                    It
 us the
        enfected return from the state the agent
  is in simplestbaseline
           Can be
                             b           RG
                                                        IE            woman
                                                                            Inge
                                                            age
            J lot
     To                 L                    To log Iolalat
                                                 REV     Stl
Novo since the value              is floating Point
                        of a state
number      0      Can be trained
                                         by
                                       minimizing MSE
Re          Actual Return
11 se        Predicted Return
                              se'T
 T   O            Iz EI CRE voi
     we need to
                  minimize    J    O hence
        0     0     B Tg     J O
                            fn Ai                              of itn
                                                  instep
       Advantage                                    episode
    when                                     Value
                       Q    S a
                                         3                 s
                                                Action         is better
                                                  than   enfected
                                                   average
 Ait              Q stilt                    V Se taiga
 To Hot                     Q  St att VI D
                        Ego
  Scaling fn             L   To log To att St
    for                             I
                loglikelihood
     Minimization of the advantage function will
     also enforces Bellman optimality
Yani
                                               man         not
                         have the
                                     full   trajectory or trajectories
                                     so             9499
                              This we cangetfrom
 value can becomputed     Value Hw
                     using    In        Now Qfn Can
                                                  1meuqd.am
                                      mdkfumtfhdtfsmg           of
                           any step
                                                                 ofvalue
Hence                                                                fn
A if ACS       a          r t y VCsg            V   s
             Actor Critic Algorithm
                                                   Policy Net       als
                                                     head
         Jobservation 91mF
                                                   value met
                                                     head        us
         Observation
          Sample            trajectories under the current policy
I I
 j m GondAom rom                         m m
                                       St at he
                                               m
t I t 2t3         Ft
M
M2
M3
i
 i
Mm
          stage        Adigfors     Reggard        Advagontage
                 tie.in               niassriem
                                                          ri
                   EY
         t't
                                                              et
EEE.fi
         sroriettinitIEniinIii                 Iu
                         Ert tr
                    t         t
   Initializeftp.cyn
                                   ugEOJLActor
               lsQlearningNkE0                     critic
   For     each
                    training         iteration 4,2 Ido
         After everyiteration policy will beupdated
                                                  Neuodates
                                                    need
               to be generated
                          for        t      1,2                     do
                                                     for   each step
 Advantages                    c
forum
Ctthstepof
                   2311
                           At
                     A doesn't
                                                     titre             Vogts't
                                                                         w
      episode
 any         defoendsufoswmlhtAetornyuoards                         Critic
                                  jallfutuserw.v                    network
                                                      Rt
                    Minimization         of     Atienforces Bellman Eg
Getting the        gz
updationfrom                      00            00
                                                  Gradient
                                                           AitTologlails
the ithepisode's    Accumulating
                        all the
                                                  ascent      Gradientestimator
thtinestep                         policy
                        updates for        Awww                        Step
scaled by
          AI
                   00
Minimizing the advantage               7                     mu
                                                                  HA          p
                                                                             nm
Fn   Discounted accumaled
                                           mm
                                                           critic who Atminimization
foulest rewards getting            for all episodes        updation enforcesBellman
 clegetottetsedicted
                                     andfor all                     Constraint
                                   timesteps just
 value function                        accumalete them
                                            all
     g           O        O       t   d B
                                                        oddproblemof
   2.60 Of                    0       13 DO       deep a
                                                           tearing
                                                     Learn Q values
                                               for all state action
           End   for                            rains
                                            Critic only learns the
                          incorporate       values for
                       Also                             generated
                 MW.iuebfm.dg d7mn rha.n
                      Ren         r   tr    LvCs's
          None   we    don't      need to   wait till the     end    of
episode   to compute reward
      Get   a    action      a
                                   cat so
      Get   a    reward            ro      and ment state
       Get the value of LSD                   as
                                                       110 Si
       Get        Q    so no                rot TH Csi
      Get       Advantage   f n Also          ao            as
A Soho ro tr Vg Si Vg So
Sample
efficient