0% found this document useful (0 votes)
20 views48 pages

13 RL 4

The document discusses Actor-Critic methods in reinforcement learning, highlighting the roles of the policy network (actor) and value network (critic). It outlines how these networks are trained using neural networks to approximate state-value and action-value functions, using techniques like temporal difference learning and policy gradients. The algorithm is summarized in a series of steps detailing the process of observing states, sampling actions, and updating network parameters based on rewards and value estimates.

Uploaded by

test fish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views48 pages

13 RL 4

The document discusses Actor-Critic methods in reinforcement learning, highlighting the roles of the policy network (actor) and value network (critic). It outlines how these networks are trained using neural networks to approximate state-value and action-value functions, using techniques like temporal difference learning and policy gradients. The algorithm is summarized in a series of steps detailing the process of observing states, sampling actions, and updating network parameters based on rewards and value estimates.

Uploaded by

test fish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Actor-Critic Methods

Shusen Wang
Value-Based Actor-Critic Policy-Based
Methods Methods Methods
Value Network and Policy Network
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .

Policy network (actor):


• Use neural net 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉 : trainable parameters of the neural net.
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .

Policy network (actor):


• Use neural net 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉 : trainable parameters of the neural net.

Value network (critic):


• Use neural net 𝑞 𝑠, 𝑎; 𝐰 to approximate 𝑄" 𝑠, 𝑎 .
• 𝐰 : trainable parameters of the neural net.
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 ≈ ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Policy network (actor):


• Use neural net 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉 : trainable parameters of the neural net.

Value network (critic):


• Use neural net 𝑞 𝑠, 𝑎; 𝐰 to approximate 𝑄" 𝑠, 𝑎 .
• 𝐰 : trainable parameters of the neural net.
Policy Network (Actor): 𝜋 𝑎 𝑠, 𝛉
• Input: state 𝑠, e.g., a screenshot of Super Mario.
• Output: probability distribution over the actions.
• Let 𝒜 be the set all actions, e.g., 𝒜 = “left”, “right”, “up” .
• ∑,∈𝒜 𝜋 𝑎 𝑠, 𝛉 = 1. (That is why we use softmax activation.)

“left”, 0.2
Conv Dense Softmax
“right”, 0.1

“up”, 0.7
state 𝑠
Value Network (Critic): 𝑞 𝑠, 𝑎; 𝐰
• Inputs: state 𝑠 and action 𝑎.
• Output: approximate action-value (scalar).

Conv

feature Dense
state 𝑠

Dense 𝑞 𝑠, 𝑎; 𝐰
concatenate (scalar function value)

action 𝑎
feature
Actor-Critic Method

policy network (actor) value network (critic)


Train the Neural Networks
Train the networks
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Training: Update the parameters 𝛉 and 𝐰.


Train the networks
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Training: Update the parameters 𝛉 and 𝐰.


• Update policy network 𝜋 𝑎 𝑠; 𝛉 to increase the state-value 𝑉 𝑠; 𝛉, 𝐰 .
• Actor gradually performs better.
• Supervision is purely from the value network (critic).
Train the networks
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Training: Update the parameters 𝛉 and 𝐰.


• Update policy network 𝜋 𝑎 𝑠; 𝛉 to increase the state-value 𝑉 𝑠; 𝛉, 𝐰 .
• Actor gradually performs better.
• Supervision is purely from the value network (critic).
• Update value network 𝑞 𝑠, 𝑎; 𝐰 to better estimate the return.
• Critic’s judgement becomes more accurate.
• Supervision is purely from the rewards.
Train the networks
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Training: Update the parameters 𝛉 and 𝐰.


1. Observe the state 𝑠C .
2. Randomly sample action 𝑎C according to 𝜋 ⋅ 𝑠C ; 𝛉C .
3. Perform 𝑎C and observe new state 𝑠CDE and reward 𝑟C .
4. Update 𝐰 (in value network) using temporal difference (TD).
5. Update 𝛉 (in policy network) using policy gradient.
Update value network 𝑞 using TD

• Compute 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞 𝑠CDE , 𝑎CDE ; 𝐰C .


• TD target: 𝑦C = 𝑟C + 𝛾 ⋅ 𝑞 𝑠CDE , 𝑎CDE ; 𝐰C .
Update value network 𝑞 using TD

• Compute 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞 𝑠CDE , 𝑎CDE ; 𝐰C .


• TD target: 𝑦C = 𝑟C + 𝛾 ⋅ 𝑞 𝑠CDE , 𝑎CDE ; 𝐰C .

E
• Loss: 𝐿 𝐰 = 𝑞 𝑠C , 𝑎C ; 𝐰 − 𝑦C K .
K
NO 𝐰
• Gradient descent: 𝐰CDE = 𝐰C − 𝛼 ⋅ │𝐰Q𝐰R .
N𝐰
Update policy network 𝜋 using policy gradient
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Policy gradient: Derivative of 𝑉 𝑠C ; 𝛉, 𝐰 w.r.t. 𝛉.


N TUV "(,|X,𝛉)
• Let 𝐠 𝑎, 𝛉 = ⋅ 𝑞 𝑠C , 𝑎; 𝐰 .
N𝛉
N Z X;𝛉,𝐰R
• = 𝔼\ 𝐠 𝐴, 𝛉 .
N𝛉
Update policy network 𝜋 using policy gradient
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Policy gradient: Derivative of 𝑉 𝑠C ; 𝛉, 𝐰 w.r.t. 𝛉.


N TUV "(,|X,𝛉)
• Let 𝐠 𝑎, 𝛉 = ⋅ 𝑞 𝑠C , 𝑎; 𝐰 .
N𝛉
N Z X;𝛉,𝐰R
• = 𝔼\ 𝐠 𝐴, 𝛉 .
N𝛉

Algorithm: Update policy network using stochastic policy gradient.


• Random sampling: 𝑎 ∼ 𝜋(⋅ |𝑠C ; 𝛉C ). (Thus 𝐠 𝑎, 𝛉 is unbiased.)
• Stochastic gradient ascent: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝐠 𝑎, 𝛉C .
Actor-Critic Method

policy network (actor) value network (critic)


Actor-Critic Method

Action 𝑎

Policy
Network Environment
(Actor)

State 𝑠
Actor-Critic Method

Action 𝑎

Policy Value 𝑞 Value


Reward 𝑟
Network Network Environment
(Actor) (Critic)

State 𝑠
Actor-Critic Method: Update Actor

Action 𝑎

Action 𝑎
Policy Value 𝑞 Value
Network Network Environment
(Actor) (Critic)

State 𝑠

State 𝑠
Actor-Critic Method: Update Critic

Action 𝑎

Policy Value 𝑞 Value


Reward 𝑟
Network Network Environment
(Actor) (Critic)

State 𝑠
Summary of Algorithm
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = │𝐰Q𝐰R .
N𝐰
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = │𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
Summary of Algorithm
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = │𝐰Q𝐰R .
N𝐰
TD Target
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = │𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
Summary of Algorithm
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = │𝐰Q𝐰R .
N𝐰
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = │𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
Summary of Algorithm
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = │𝐰Q𝐰R .
N𝐰
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = │𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
Summary of Algorithm
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = │𝐰Q𝐰R .
N𝐰
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = │𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
Summary of Algorithm
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = │𝐰Q𝐰R .
N𝐰
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = │𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
𝛿C
Policy Gradient with Baseline
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = │𝐰Q𝐰R .
N𝐰
Baseline
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = │𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
𝛿C
Summary
Policy Network and Value Network

Definition: State-value function.


• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .

Definition: function approximation using neural networks.


• Approximate policy function 𝜋 𝑎 𝑠 by 𝜋 𝑎 𝑠; 𝛉 (actor).
• Approximate value function 𝑄" 𝑠, 𝑎 by 𝑞 𝑠, 𝑎; 𝐰 (critic).
Roles of Actor and Critic

During training
• Agent is controlled by policy network (actor): 𝑎C ∼ 𝜋(⋅ |𝑠C ; 𝛉).
• Value network 𝑞 (critic) provides the actor with supervision.
Roles of Actor and Critic

During training
• Agent is controlled by policy network (actor): 𝑎C ∼ 𝜋(⋅ |𝑠C ; 𝛉).
• Value network 𝑞 (critic) provides the actor with supervision.

After training
• Agent is controlled by policy network (actor): 𝑎C ∼ 𝜋(⋅ |𝑠C ; 𝛉).
• Value network 𝑞 (critic) will not be used.
Training
Update the policy network (actor) by policy gradient.
• Seek to increase state-value: 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
N Z X;𝛉 N TUV "(\|X,𝛉)
• Compute policy gradient: = 𝔼\ ⋅ 𝑞 𝑠, 𝐴; 𝐰 .
N𝛉 N𝛉
• Perform gradient ascent.
Training
Update the policy network (actor) by policy gradient.
• Seek to increase state-value: 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
N Z X;𝛉 N TUV "(\|X,𝛉)
• Compute policy gradient: = 𝔼\ ⋅ 𝑞 𝑠, 𝐴; 𝐰 .
N𝛉 N𝛉
• Perform gradient ascent.

Update the value network (critic) by TD learning.


• Predicted action-value: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰 .
• TD target: 𝑦C = 𝑟C + 𝛾 ⋅ 𝑞 𝑠CDE , 𝑎CDE ; 𝐰
N eR ijR k /K N e XR ,,R ;𝐰
• Gradient: = 𝑞C − 𝑦C ⋅ .
N𝐰 N𝐰
• Perform gradient descent.
Thank you!
Policy Gradient with Baseline
Policy Gradient with Baseline
Definition: Approximated state-value function.
• 𝑉 𝑠; 𝛉 − 𝑏 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄" 𝑠, 𝑎 − 𝑏 .
• Here, the baseline 𝑏 must be independent of 𝛉 and 𝑎.
Policy Gradient with Baseline
Definition: Approximated state-value function.
• 𝑉 𝑠; 𝛉 − 𝑏 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄" 𝑠, 𝑎 − 𝑏 .
• Here, the baseline 𝑏 must be independent of 𝛉 and 𝑎.

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


N Z X;𝛉 N Z X;𝛉 in N TUV "(,|X,𝛉)
• = = 𝔼,~"(⋅|X;𝛉) ⋅ 𝑄" 𝑠, 𝑎 − 𝑏 .
N𝛉 N𝛉 N𝛉
• The baseline 𝑏 does not affect correctness.
• A good baseline 𝑏 can reduce variance.
• We can use 𝑏 = 𝑟C + 𝛾 ⋅ 𝑞CDE (TD target) as the baseline.
Actor Critic Update (without baseline)
1. Observe the state 𝑠C ; randomly sample action 𝑎C according to 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; observe new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎CDE according to 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎CDE .)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎CDE ; 𝐰C .
5. Compute the TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = │𝐰Q𝐰R .
N𝐰
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = │𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
Actor Critic Update (with baseline)
1. Observe the state 𝑠C ; randomly sample action 𝑎C according to 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; observe new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎CDE according to 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎CDE .)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎CDE ; 𝐰C .
5. Compute the TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = │𝐰Q𝐰R .
N𝐰
Baseline
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = │𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝛿C ⋅ 𝐝f,C .
Deterministic Policy Gradient (DPG)

Reference

• Silver and others: Deterministic Policy Gradient Algorithms. In ICML, 2014.


• Lillicrap and others: Continuous control with deep reinforcement learning. arXiv:1509.02971. 2015.
Deterministic Policy Gradient (DPG)
• DPG is a actor-critic method.
• The policy network is deterministic: 𝑎 = 𝜋 𝑠; 𝛉 .
• Trained value network by TD learning.
• Train policy network to maximize the value 𝑞 𝑠, 𝑎; 𝐰 .

Value
Network
Policy (Parameter:𝐰) value
Network 𝑞 𝑠, 𝑎; 𝐰
(Parameter:θ) action
state s 𝑎 = 𝜋 𝑠; 𝛉
Deterministic Policy Gradient (DPG)
• DPG is a actor-critic method.
• The policy network is deterministic: 𝑎 = 𝜋 𝑠; 𝛉 .
• Trained value network by TD learning.
• Train policy network to maximize the value 𝑞 𝑠, 𝑎; 𝐰 .

Value
Network
Policy (Parameter:𝐰) value
Network 𝑞 𝑠, 𝑎; 𝐰
(Parameter:𝛉) action
state s 𝑎 = 𝜋 𝑠; 𝛉
Train Policy Network
• Train policy network to maximize the value 𝑞 𝑠, 𝑎; 𝐰 .
N e X,,;𝐰 N, N e X,,;𝐰
• Gradient: = ⋅ .
N𝛉 N𝛉 N,
• Update 𝛉 using gradient ascent.

Value
Network
Policy (Parameter:𝐰) value
Network 𝑞 𝑠, 𝑎; 𝐰
(Parameter:𝛉) action
state s 𝑎 = 𝜋 𝑠; 𝛉

You might also like