Actor-Critic Methods
Shusen Wang
Value-Based Actor-Critic Policy-Based
Methods Methods Methods
Value Network and Policy Network
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .
Policy network (actor):
• Use neural net 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉 : trainable parameters of the neural net.
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .
Policy network (actor):
• Use neural net 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉 : trainable parameters of the neural net.
Value network (critic):
• Use neural net 𝑞 𝑠, 𝑎; 𝐰 to approximate 𝑄" 𝑠, 𝑎 .
• 𝐰 : trainable parameters of the neural net.
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 ≈ ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
Policy network (actor):
• Use neural net 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉 : trainable parameters of the neural net.
Value network (critic):
• Use neural net 𝑞 𝑠, 𝑎; 𝐰 to approximate 𝑄" 𝑠, 𝑎 .
• 𝐰 : trainable parameters of the neural net.
Policy Network (Actor): 𝜋 𝑎 𝑠, 𝛉
• Input: state 𝑠, e.g., a screenshot of Super Mario.
• Output: probability distribution over the actions.
• Let 𝒜 be the set all actions, e.g., 𝒜 = “left”, “right”, “up” .
• ∑,∈𝒜 𝜋 𝑎 𝑠, 𝛉 = 1. (That is why we use softmax activation.)
“left”, 0.2
Conv Dense Softmax
“right”, 0.1
“up”, 0.7
state 𝑠
Value Network (Critic): 𝑞 𝑠, 𝑎; 𝐰
• Inputs: state 𝑠 and action 𝑎.
• Output: approximate action-value (scalar).
Conv
feature Dense
state 𝑠
Dense 𝑞 𝑠, 𝑎; 𝐰
concatenate (scalar function value)
action 𝑎
feature
Actor-Critic Method
policy network (actor) value network (critic)
Train the Neural Networks
Train the networks
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
Training: Update the parameters 𝛉 and 𝐰.
Train the networks
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
Training: Update the parameters 𝛉 and 𝐰.
• Update policy network 𝜋 𝑎 𝑠; 𝛉 to increase the state-value 𝑉 𝑠; 𝛉, 𝐰 .
• Actor gradually performs better.
• Supervision is purely from the value network (critic).
Train the networks
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
Training: Update the parameters 𝛉 and 𝐰.
• Update policy network 𝜋 𝑎 𝑠; 𝛉 to increase the state-value 𝑉 𝑠; 𝛉, 𝐰 .
• Actor gradually performs better.
• Supervision is purely from the value network (critic).
• Update value network 𝑞 𝑠, 𝑎; 𝐰 to better estimate the return.
• Critic’s judgement becomes more accurate.
• Supervision is purely from the rewards.
Train the networks
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
Training: Update the parameters 𝛉 and 𝐰.
1. Observe the state 𝑠C .
2. Randomly sample action 𝑎C according to 𝜋 ⋅ 𝑠C ; 𝛉C .
3. Perform 𝑎C and observe new state 𝑠CDE and reward 𝑟C .
4. Update 𝐰 (in value network) using temporal difference (TD).
5. Update 𝛉 (in policy network) using policy gradient.
Update value network 𝑞 using TD
• Compute 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞 𝑠CDE , 𝑎CDE ; 𝐰C .
• TD target: 𝑦C = 𝑟C + 𝛾 ⋅ 𝑞 𝑠CDE , 𝑎CDE ; 𝐰C .
Update value network 𝑞 using TD
• Compute 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞 𝑠CDE , 𝑎CDE ; 𝐰C .
• TD target: 𝑦C = 𝑟C + 𝛾 ⋅ 𝑞 𝑠CDE , 𝑎CDE ; 𝐰C .
E
• Loss: 𝐿 𝐰 = 𝑞 𝑠C , 𝑎C ; 𝐰 − 𝑦C K .
K
NO 𝐰
• Gradient descent: 𝐰CDE = 𝐰C − 𝛼 ⋅ │𝐰Q𝐰R .
N𝐰
Update policy network 𝜋 using policy gradient
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
Policy gradient: Derivative of 𝑉 𝑠C ; 𝛉, 𝐰 w.r.t. 𝛉.
N TUV "(,|X,𝛉)
• Let 𝐠 𝑎, 𝛉 = ⋅ 𝑞 𝑠C , 𝑎; 𝐰 .
N𝛉
N Z X;𝛉,𝐰R
• = 𝔼\ 𝐠 𝐴, 𝛉 .
N𝛉
Update policy network 𝜋 using policy gradient
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
Policy gradient: Derivative of 𝑉 𝑠C ; 𝛉, 𝐰 w.r.t. 𝛉.
N TUV "(,|X,𝛉)
• Let 𝐠 𝑎, 𝛉 = ⋅ 𝑞 𝑠C , 𝑎; 𝐰 .
N𝛉
N Z X;𝛉,𝐰R
• = 𝔼\ 𝐠 𝐴, 𝛉 .
N𝛉
Algorithm: Update policy network using stochastic policy gradient.
• Random sampling: 𝑎 ∼ 𝜋(⋅ |𝑠C ; 𝛉C ). (Thus 𝐠 𝑎, 𝛉 is unbiased.)
• Stochastic gradient ascent: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝐠 𝑎, 𝛉C .
Actor-Critic Method
policy network (actor) value network (critic)
Actor-Critic Method
Action 𝑎
Policy
Network Environment
(Actor)
State 𝑠
Actor-Critic Method
Action 𝑎
Policy Value 𝑞 Value
Reward 𝑟
Network Network Environment
(Actor) (Critic)
State 𝑠
Actor-Critic Method: Update Actor
Action 𝑎
Action 𝑎
Policy Value 𝑞 Value
Network Network Environment
(Actor) (Critic)
State 𝑠
State 𝑠
Actor-Critic Method: Update Critic
Action 𝑎
Policy Value 𝑞 Value
Reward 𝑟
Network Network Environment
(Actor) (Critic)
State 𝑠
Summary of Algorithm
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = │𝐰Q𝐰R .
N𝐰
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = │𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
Summary of Algorithm
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = │𝐰Q𝐰R .
N𝐰
TD Target
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = │𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
Summary of Algorithm
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = │𝐰Q𝐰R .
N𝐰
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = │𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
Summary of Algorithm
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = │𝐰Q𝐰R .
N𝐰
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = │𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
Summary of Algorithm
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = │𝐰Q𝐰R .
N𝐰
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = │𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
Summary of Algorithm
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = │𝐰Q𝐰R .
N𝐰
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = │𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
𝛿C
Policy Gradient with Baseline
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = │𝐰Q𝐰R .
N𝐰
Baseline
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = │𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
𝛿C
Summary
Policy Network and Value Network
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .
Definition: function approximation using neural networks.
• Approximate policy function 𝜋 𝑎 𝑠 by 𝜋 𝑎 𝑠; 𝛉 (actor).
• Approximate value function 𝑄" 𝑠, 𝑎 by 𝑞 𝑠, 𝑎; 𝐰 (critic).
Roles of Actor and Critic
During training
• Agent is controlled by policy network (actor): 𝑎C ∼ 𝜋(⋅ |𝑠C ; 𝛉).
• Value network 𝑞 (critic) provides the actor with supervision.
Roles of Actor and Critic
During training
• Agent is controlled by policy network (actor): 𝑎C ∼ 𝜋(⋅ |𝑠C ; 𝛉).
• Value network 𝑞 (critic) provides the actor with supervision.
After training
• Agent is controlled by policy network (actor): 𝑎C ∼ 𝜋(⋅ |𝑠C ; 𝛉).
• Value network 𝑞 (critic) will not be used.
Training
Update the policy network (actor) by policy gradient.
• Seek to increase state-value: 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
N Z X;𝛉 N TUV "(\|X,𝛉)
• Compute policy gradient: = 𝔼\ ⋅ 𝑞 𝑠, 𝐴; 𝐰 .
N𝛉 N𝛉
• Perform gradient ascent.
Training
Update the policy network (actor) by policy gradient.
• Seek to increase state-value: 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
N Z X;𝛉 N TUV "(\|X,𝛉)
• Compute policy gradient: = 𝔼\ ⋅ 𝑞 𝑠, 𝐴; 𝐰 .
N𝛉 N𝛉
• Perform gradient ascent.
Update the value network (critic) by TD learning.
• Predicted action-value: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰 .
• TD target: 𝑦C = 𝑟C + 𝛾 ⋅ 𝑞 𝑠CDE , 𝑎CDE ; 𝐰
N eR ijR k /K N e XR ,,R ;𝐰
• Gradient: = 𝑞C − 𝑦C ⋅ .
N𝐰 N𝐰
• Perform gradient descent.
Thank you!
Policy Gradient with Baseline
Policy Gradient with Baseline
Definition: Approximated state-value function.
• 𝑉 𝑠; 𝛉 − 𝑏 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄" 𝑠, 𝑎 − 𝑏 .
• Here, the baseline 𝑏 must be independent of 𝛉 and 𝑎.
Policy Gradient with Baseline
Definition: Approximated state-value function.
• 𝑉 𝑠; 𝛉 − 𝑏 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄" 𝑠, 𝑎 − 𝑏 .
• Here, the baseline 𝑏 must be independent of 𝛉 and 𝑎.
Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.
N Z X;𝛉 N Z X;𝛉 in N TUV "(,|X,𝛉)
• = = 𝔼,~"(⋅|X;𝛉) ⋅ 𝑄" 𝑠, 𝑎 − 𝑏 .
N𝛉 N𝛉 N𝛉
• The baseline 𝑏 does not affect correctness.
• A good baseline 𝑏 can reduce variance.
• We can use 𝑏 = 𝑟C + 𝛾 ⋅ 𝑞CDE (TD target) as the baseline.
Actor Critic Update (without baseline)
1. Observe the state 𝑠C ; randomly sample action 𝑎C according to 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; observe new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎CDE according to 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎CDE .)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎CDE ; 𝐰C .
5. Compute the TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = │𝐰Q𝐰R .
N𝐰
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = │𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
Actor Critic Update (with baseline)
1. Observe the state 𝑠C ; randomly sample action 𝑎C according to 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; observe new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎CDE according to 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎CDE .)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎CDE ; 𝐰C .
5. Compute the TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = │𝐰Q𝐰R .
N𝐰
Baseline
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = │𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝛿C ⋅ 𝐝f,C .
Deterministic Policy Gradient (DPG)
Reference
• Silver and others: Deterministic Policy Gradient Algorithms. In ICML, 2014.
• Lillicrap and others: Continuous control with deep reinforcement learning. arXiv:1509.02971. 2015.
Deterministic Policy Gradient (DPG)
• DPG is a actor-critic method.
• The policy network is deterministic: 𝑎 = 𝜋 𝑠; 𝛉 .
• Trained value network by TD learning.
• Train policy network to maximize the value 𝑞 𝑠, 𝑎; 𝐰 .
Value
Network
Policy (Parameter:𝐰) value
Network 𝑞 𝑠, 𝑎; 𝐰
(Parameter:θ) action
state s 𝑎 = 𝜋 𝑠; 𝛉
Deterministic Policy Gradient (DPG)
• DPG is a actor-critic method.
• The policy network is deterministic: 𝑎 = 𝜋 𝑠; 𝛉 .
• Trained value network by TD learning.
• Train policy network to maximize the value 𝑞 𝑠, 𝑎; 𝐰 .
Value
Network
Policy (Parameter:𝐰) value
Network 𝑞 𝑠, 𝑎; 𝐰
(Parameter:𝛉) action
state s 𝑎 = 𝜋 𝑠; 𝛉
Train Policy Network
• Train policy network to maximize the value 𝑞 𝑠, 𝑎; 𝐰 .
N e X,,;𝐰 N, N e X,,;𝐰
• Gradient: = ⋅ .
N𝛉 N𝛉 N,
• Update 𝛉 using gradient ascent.
Value
Network
Policy (Parameter:𝐰) value
Network 𝑞 𝑠, 𝑎; 𝐰
(Parameter:𝛉) action
state s 𝑎 = 𝜋 𝑠; 𝛉