0% found this document useful (0 votes)

20 views48 pages

13 RL 4

The document discusses Actor-Critic methods in reinforcement learning, highlighting the roles of the policy network (actor) and value network (critic). It outlines how these networks are trained using neural networks to approximate state-value and action-value functions, using techniques like temporal difference learning and policy gradients. The algorithm is summarized in a series of steps detailing the process of observing states, sampling actions, and updating network parameters based on rewards and value estimates.

Uploaded by

test fish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views48 pages

13 RL 4

Uploaded by

test fish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Actor-Critic Methods

Shusen Wang
Value-Based Actor-Critic Policy-Based
Methods Methods Methods
Value Network and Policy Network
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .

Policy network (actor):

• Use neural net 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉 : trainable parameters of the neural net.
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .

Policy network (actor):

• Use neural net 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉 : trainable parameters of the neural net.

Value network (critic):

• Use neural net 𝑞 𝑠, 𝑎; 𝐰 to approximate 𝑄" 𝑠, 𝑎 .
• 𝐰 : trainable parameters of the neural net.
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 ≈ ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Policy network (actor):

• Use neural net 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉 : trainable parameters of the neural net.

Value network (critic):

• Use neural net 𝑞 𝑠, 𝑎; 𝐰 to approximate 𝑄" 𝑠, 𝑎 .
• 𝐰 : trainable parameters of the neural net.
Policy Network (Actor): 𝜋 𝑎 𝑠, 𝛉
• Input: state 𝑠, e.g., a screenshot of Super Mario.
• Output: probability distribution over the actions.
• Let 𝒜 be the set all actions, e.g., 𝒜 = “left”, “right”, “up” .
• ∑,∈𝒜 𝜋 𝑎 𝑠, 𝛉 = 1. (That is why we use softmax activation.)

“left”, 0.2
Conv Dense Softmax
“right”, 0.1

“up”, 0.7
state 𝑠
Value Network (Critic): 𝑞 𝑠, 𝑎; 𝐰
• Inputs: state 𝑠 and action 𝑎.
• Output: approximate action-value (scalar).

Conv

feature Dense
state 𝑠

Dense 𝑞 𝑠, 𝑎; 𝐰
concatenate (scalar function value)

action 𝑎
feature
Actor-Critic Method

policy network (actor) value network (critic)

Train the Neural Networks
Train the networks
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Training: Update the parameters 𝛉 and 𝐰.

Train the networks
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Training: Update the parameters 𝛉 and 𝐰.

• Update policy network 𝜋 𝑎 𝑠; 𝛉 to increase the state-value 𝑉 𝑠; 𝛉, 𝐰 .
• Actor gradually performs better.
• Supervision is purely from the value network (critic).
Train the networks
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Training: Update the parameters 𝛉 and 𝐰.

• Update policy network 𝜋 𝑎 𝑠; 𝛉 to increase the state-value 𝑉 𝑠; 𝛉, 𝐰 .
• Actor gradually performs better.
• Supervision is purely from the value network (critic).
• Update value network 𝑞 𝑠, 𝑎; 𝐰 to better estimate the return.
• Critic’s judgement becomes more accurate.
• Supervision is purely from the rewards.
Train the networks
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Training: Update the parameters 𝛉 and 𝐰.

1. Observe the state 𝑠C .
2. Randomly sample action 𝑎C according to 𝜋 ⋅ 𝑠C ; 𝛉C .
3. Perform 𝑎C and observe new state 𝑠CDE and reward 𝑟C .
4. Update 𝐰 (in value network) using temporal difference (TD).
5. Update 𝛉 (in policy network) using policy gradient.
Update value network 𝑞 using TD

• Compute 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞 𝑠CDE , 𝑎CDE ; 𝐰C .

• TD target: 𝑦C = 𝑟C + 𝛾 ⋅ 𝑞 𝑠CDE , 𝑎CDE ; 𝐰C .
Update value network 𝑞 using TD

• Compute 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞 𝑠CDE , 𝑎CDE ; 𝐰C .

• TD target: 𝑦C = 𝑟C + 𝛾 ⋅ 𝑞 𝑠CDE , 𝑎CDE ; 𝐰C .

E
• Loss: 𝐿 𝐰 = 𝑞 𝑠C , 𝑎C ; 𝐰 − 𝑦C K .
K
NO 𝐰
• Gradient descent: 𝐰CDE = 𝐰C − 𝛼 ⋅ │𝐰Q𝐰R .
N𝐰
Update policy network 𝜋 using policy gradient
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Policy gradient: Derivative of 𝑉 𝑠C ; 𝛉, 𝐰 w.r.t. 𝛉.

N TUV "(,|X,𝛉)
• Let 𝐠 𝑎, 𝛉 = ⋅ 𝑞 𝑠C , 𝑎; 𝐰 .
N𝛉
N Z X;𝛉,𝐰R
• = 𝔼\ 𝐠 𝐴, 𝛉 .
N𝛉
Update policy network 𝜋 using policy gradient
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Policy gradient: Derivative of 𝑉 𝑠C ; 𝛉, 𝐰 w.r.t. 𝛉.

N TUV "(,|X,𝛉)
• Let 𝐠 𝑎, 𝛉 = ⋅ 𝑞 𝑠C , 𝑎; 𝐰 .
N𝛉
N Z X;𝛉,𝐰R
• = 𝔼\ 𝐠 𝐴, 𝛉 .
N𝛉

Algorithm: Update policy network using stochastic policy gradient.

• Random sampling: 𝑎 ∼ 𝜋(⋅ |𝑠C ; 𝛉C ). (Thus 𝐠 𝑎, 𝛉 is unbiased.)
• Stochastic gradient ascent: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝐠 𝑎, 𝛉C .
Actor-Critic Method

policy network (actor) value network (critic)

Actor-Critic Method

Action 𝑎

Policy
Network Environment
(Actor)

State 𝑠
Actor-Critic Method

Action 𝑎

Policy Value 𝑞 Value

Reward 𝑟
Network Network Environment
(Actor) (Critic)

State 𝑠
Actor-Critic Method: Update Actor

Action 𝑎

Action 𝑎
Policy Value 𝑞 Value
Network Network Environment
(Actor) (Critic)

State 𝑠

State 𝑠
Actor-Critic Method: Update Critic

Action 𝑎

Policy Value 𝑞 Value

Reward 𝑟
Network Network Environment
(Actor) (Critic)

State 𝑠
Summary of Algorithm
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = ￨𝐰Q𝐰R .
N𝐰
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = ￨𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
Summary of Algorithm
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = ￨𝐰Q𝐰R .
N𝐰
TD Target
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = ￨𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
Summary of Algorithm
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = ￨𝐰Q𝐰R .
N𝐰
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = ￨𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
Summary of Algorithm
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = ￨𝐰Q𝐰R .
N𝐰
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = ￨𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
Summary of Algorithm
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = ￨𝐰Q𝐰R .
N𝐰
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = ￨𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
Summary of Algorithm
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = ￨𝐰Q𝐰R .
N𝐰
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = ￨𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
𝛿C
Policy Gradient with Baseline
1. Observe state 𝑠C and randomly sample 𝑎C ~ 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; then environment gives new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎aCDE ~ 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎aCDE !)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎aCDE ; 𝐰C .
5. Compute TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = ￨𝐰Q𝐰R .
N𝐰
Baseline
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = ￨𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
𝛿C
Summary
Policy Network and Value Network

Definition: State-value function.

• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .

Definition: function approximation using neural networks.

• Approximate policy function 𝜋 𝑎 𝑠 by 𝜋 𝑎 𝑠; 𝛉 (actor).
• Approximate value function 𝑄" 𝑠, 𝑎 by 𝑞 𝑠, 𝑎; 𝐰 (critic).
Roles of Actor and Critic

During training
• Agent is controlled by policy network (actor): 𝑎C ∼ 𝜋(⋅ |𝑠C ; 𝛉).
• Value network 𝑞 (critic) provides the actor with supervision.
Roles of Actor and Critic

During training
• Agent is controlled by policy network (actor): 𝑎C ∼ 𝜋(⋅ |𝑠C ; 𝛉).
• Value network 𝑞 (critic) provides the actor with supervision.

After training
• Agent is controlled by policy network (actor): 𝑎C ∼ 𝜋(⋅ |𝑠C ; 𝛉).
• Value network 𝑞 (critic) will not be used.
Training
Update the policy network (actor) by policy gradient.
• Seek to increase state-value: 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
N Z X;𝛉 N TUV "(\|X,𝛉)
• Compute policy gradient: = 𝔼\ ⋅ 𝑞 𝑠, 𝐴; 𝐰 .
N𝛉 N𝛉
• Perform gradient ascent.
Training
Update the policy network (actor) by policy gradient.
• Seek to increase state-value: 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
N Z X;𝛉 N TUV "(\|X,𝛉)
• Compute policy gradient: = 𝔼\ ⋅ 𝑞 𝑠, 𝐴; 𝐰 .
N𝛉 N𝛉
• Perform gradient ascent.

Update the value network (critic) by TD learning.

• Predicted action-value: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰 .
• TD target: 𝑦C = 𝑟C + 𝛾 ⋅ 𝑞 𝑠CDE , 𝑎CDE ; 𝐰
N eR ijR k /K N e XR ,,R ;𝐰
• Gradient: = 𝑞C − 𝑦C ⋅ .
N𝐰 N𝐰
• Perform gradient descent.
Thank you!
Policy Gradient with Baseline
Policy Gradient with Baseline
Definition: Approximated state-value function.
• 𝑉 𝑠; 𝛉 − 𝑏 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄" 𝑠, 𝑎 − 𝑏 .
• Here, the baseline 𝑏 must be independent of 𝛉 and 𝑎.
Policy Gradient with Baseline
Definition: Approximated state-value function.
• 𝑉 𝑠; 𝛉 − 𝑏 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄" 𝑠, 𝑎 − 𝑏 .
• Here, the baseline 𝑏 must be independent of 𝛉 and 𝑎.

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.

N Z X;𝛉 N Z X;𝛉 in N TUV "(,|X,𝛉)
• = = 𝔼,~"(⋅|X;𝛉) ⋅ 𝑄" 𝑠, 𝑎 − 𝑏 .
N𝛉 N𝛉 N𝛉
• The baseline 𝑏 does not affect correctness.
• A good baseline 𝑏 can reduce variance.
• We can use 𝑏 = 𝑟C + 𝛾 ⋅ 𝑞CDE (TD target) as the baseline.
Actor Critic Update (without baseline)
1. Observe the state 𝑠C ; randomly sample action 𝑎C according to 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; observe new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎CDE according to 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎CDE .)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎CDE ; 𝐰C .
5. Compute the TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = ￨𝐰Q𝐰R .
N𝐰
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = ￨𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝑞C ⋅ 𝐝f,C .
Actor Critic Update (with baseline)
1. Observe the state 𝑠C ; randomly sample action 𝑎C according to 𝜋 ⋅ 𝑠C ; 𝛉C .
2. Perform 𝑎C ; observe new state 𝑠CDE and reward 𝑟C .
3. Randomly sample 𝑎CDE according to 𝜋 ⋅ 𝑠CDE ; 𝛉C . (Do not perform 𝑎CDE .)
4. Evaluate value network: 𝑞C = 𝑞 𝑠C , 𝑎C ; 𝐰C and 𝑞CDE = 𝑞 𝑠CDE , 𝑎CDE ; 𝐰C .
5. Compute the TD error: 𝛿C = 𝑞C − 𝑟C + 𝛾 ⋅ 𝑞CDE .
Ne(XR ,,R ;𝐰)
6. Differentiate value network: 𝐝d,C = ￨𝐰Q𝐰R .
N𝐰
Baseline
7. Update value network: 𝐰CDE = 𝐰C − 𝛼 ⋅ 𝛿C ⋅ 𝐝d,C .
N TUV "(,R |XR ,𝛉)
8. Differentiate policy network: 𝐝f,C = ￨𝛉Q𝛉R .
N𝛉
9. Update policy network: 𝛉CDE = 𝛉C + 𝛽 ⋅ 𝛿C ⋅ 𝐝f,C .
Deterministic Policy Gradient (DPG)

Reference

• Silver and others: Deterministic Policy Gradient Algorithms. In ICML, 2014.

• Lillicrap and others: Continuous control with deep reinforcement learning. arXiv:1509.02971. 2015.
Deterministic Policy Gradient (DPG)
• DPG is a actor-critic method.
• The policy network is deterministic: 𝑎 = 𝜋 𝑠; 𝛉 .
• Trained value network by TD learning.
• Train policy network to maximize the value 𝑞 𝑠, 𝑎; 𝐰 .

Value
Network
Policy (Parameter:𝐰) value
Network 𝑞 𝑠, 𝑎; 𝐰
(Parameter:θ) action
state s 𝑎 = 𝜋 𝑠; 𝛉
Deterministic Policy Gradient (DPG)
• DPG is a actor-critic method.
• The policy network is deterministic: 𝑎 = 𝜋 𝑠; 𝛉 .
• Trained value network by TD learning.
• Train policy network to maximize the value 𝑞 𝑠, 𝑎; 𝐰 .

Value
Network
Policy (Parameter:𝐰) value
Network 𝑞 𝑠, 𝑎; 𝐰
(Parameter:𝛉) action
state s 𝑎 = 𝜋 𝑠; 𝛉
Train Policy Network
• Train policy network to maximize the value 𝑞 𝑠, 𝑎; 𝐰 .
N e X,,;𝐰 N, N e X,,;𝐰
• Gradient: = ⋅ .
N𝛉 N𝛉 N,
• Update 𝛉 using gradient ascent.

Value
Network
Policy (Parameter:𝐰) value
Network 𝑞 𝑠, 𝑎; 𝐰
(Parameter:𝛉) action
state s 𝑎 = 𝜋 𝑠; 𝛉

Policy-Based Reinforcement Learning: Shusen Wang
No ratings yet
Policy-Based Reinforcement Learning: Shusen Wang
46 pages
13 RL 3
No ratings yet
13 RL 3
48 pages
Deep Reinforcement Learning: 1 Notation
No ratings yet
Deep Reinforcement Learning: 1 Notation
9 pages
Unit7 RL
No ratings yet
Unit7 RL
7 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
No ratings yet
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
55 pages
Lec 5 Policy Gradients
No ratings yet
Lec 5 Policy Gradients
40 pages
9 Sqoop Notes
No ratings yet
9 Sqoop Notes
35 pages
Dis9 Sol
No ratings yet
Dis9 Sol
8 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
Policy Gradient Methods Guide
No ratings yet
Policy Gradient Methods Guide
28 pages
Lnotes 05
No ratings yet
Lnotes 05
5 pages
3 - Chapter 10 Actor-Critic Methods
No ratings yet
3 - Chapter 10 Actor-Critic Methods
22 pages
RL 3
No ratings yet
RL 3
31 pages
Advanced Reinforcement Learning
No ratings yet
Advanced Reinforcement Learning
46 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
CS 234: Assignment #2: 1 Deep - Networks (DQN) (8 Pts Writeup)
No ratings yet
CS 234: Assignment #2: 1 Deep - Networks (DQN) (8 Pts Writeup)
9 pages
2023 Week5 Policy
No ratings yet
2023 Week5 Policy
62 pages
20ai903 - RL - Unit 4
No ratings yet
20ai903 - RL - Unit 4
49 pages
2023 Week4 Funcapproximate Update
No ratings yet
2023 Week4 Funcapproximate Update
69 pages
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
No ratings yet
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
32 pages
Deep Reinforcement Learning Basics
No ratings yet
Deep Reinforcement Learning Basics
64 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Unit Iii Monte Carlo & Temporal Difference Methods
No ratings yet
Unit Iii Monte Carlo & Temporal Difference Methods
18 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
Intro to Policy Optimization
No ratings yet
Intro to Policy Optimization
10 pages
Policy Gradient Methods-BR
No ratings yet
Policy Gradient Methods-BR
14 pages
Lecture 6 Structuring of Policies-Part-1
No ratings yet
Lecture 6 Structuring of Policies-Part-1
36 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
cs224r L04 Actor Critic
No ratings yet
cs224r L04 Actor Critic
89 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
48 pages
L10 Actor Critic
No ratings yet
L10 Actor Critic
56 pages
Module 6
No ratings yet
Module 6
47 pages
Value Function Approximation SEO Guide
No ratings yet
Value Function Approximation SEO Guide
59 pages
6aqfs-Soner 0
No ratings yet
6aqfs-Soner 0
33 pages
RL Chap 4
No ratings yet
RL Chap 4
7 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
RL 5
No ratings yet
RL 5
26 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
Book All in One
No ratings yet
Book All in One
288 pages
Lecture 5 Rewards and Policy Structures-Part-2
No ratings yet
Lecture 5 Rewards and Policy Structures-Part-2
34 pages
RL Week - 3 - 4
No ratings yet
RL Week - 3 - 4
33 pages
13 ML Reinforcement Learning - Policy Search
No ratings yet
13 ML Reinforcement Learning - Policy Search
10 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
5 Policy 1
No ratings yet
5 Policy 1
51 pages
RL Algorithms in Gymnasium
No ratings yet
RL Algorithms in Gymnasium
59 pages
Temporal Difference (TD) Learning: Slides Prepared by DR J Alamelu Mangai
No ratings yet
Temporal Difference (TD) Learning: Slides Prepared by DR J Alamelu Mangai
57 pages
Fa Ii
No ratings yet
Fa Ii
62 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
ESO 208A: Computational Methods in Engineering: Department of Civil Engineering IIT Kanpur
No ratings yet
ESO 208A: Computational Methods in Engineering: Department of Civil Engineering IIT Kanpur
13 pages
Logic Gates
No ratings yet
Logic Gates
11 pages
Suprema CCURE Integration ADV IG UM v1.0 EN
No ratings yet
Suprema CCURE Integration ADV IG UM v1.0 EN
109 pages
TLink Option Module - 750com-Um100 - En-P
No ratings yet
TLink Option Module - 750com-Um100 - En-P
28 pages
Final Seminar Report
No ratings yet
Final Seminar Report
27 pages
Goons Raid Her F95zone 2019-04-13 Goons Raid Her Walkthrough v021 Under Development
No ratings yet
Goons Raid Her F95zone 2019-04-13 Goons Raid Her Walkthrough v021 Under Development
21 pages
Steve Job Story Exercise
No ratings yet
Steve Job Story Exercise
4 pages
Squid Proxy Setup for RHEL5/CentOS
No ratings yet
Squid Proxy Setup for RHEL5/CentOS
5 pages
Lower Anchor Details: Floor Plate Button Location Floor Plate With Comb
No ratings yet
Lower Anchor Details: Floor Plate Button Location Floor Plate With Comb
1 page
Topic 4 - Data Management
No ratings yet
Topic 4 - Data Management
38 pages
Kubernetes For Beginners
100% (1)
Kubernetes For Beginners
29 pages
Research Analyst
No ratings yet
Research Analyst
3 pages
Abrir Pasta Downloads
No ratings yet
Abrir Pasta Downloads
70 pages
T13-005 BACnet Communications Troubleshooting-1
No ratings yet
T13-005 BACnet Communications Troubleshooting-1
4 pages
Full Download Wide Area Power System Monitoring and Control 1st Edition Hassan Bevrani PDF
No ratings yet
Full Download Wide Area Power System Monitoring and Control 1st Edition Hassan Bevrani PDF
32 pages
Route Survey Literature Review
100% (2)
Route Survey Literature Review
8 pages
Fs Avaya Vantage Uc
No ratings yet
Fs Avaya Vantage Uc
6 pages
Qualcomm Advanced 5G Webinar Slides
No ratings yet
Qualcomm Advanced 5G Webinar Slides
39 pages
Big Data Analytics
No ratings yet
Big Data Analytics
134 pages
Icare Data Recovery Pro License Code EXCLUSIVE
No ratings yet
Icare Data Recovery Pro License Code EXCLUSIVE
3 pages
CompTIA Security Guide To Network Security Fundamentals 6th Edition Mark Ciampa
No ratings yet
CompTIA Security Guide To Network Security Fundamentals 6th Edition Mark Ciampa
323 pages
Compiler Construction Week 14
No ratings yet
Compiler Construction Week 14
23 pages
AWS Cloud Practitioner Exam Question & Answer
100% (1)
AWS Cloud Practitioner Exam Question & Answer
104 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
62 pages
Computer Science BSC Curriculum (June 2020)
No ratings yet
Computer Science BSC Curriculum (June 2020)
275 pages
Erp CH4
No ratings yet
Erp CH4
29 pages
Panasonic Gh2 Setup
100% (1)
Panasonic Gh2 Setup
2 pages
Notification For Recruitment of Positions at COEFPO - 31.05.25
No ratings yet
Notification For Recruitment of Positions at COEFPO - 31.05.25
5 pages
Semester I - Problem Solving and Python Programming (Ge8151) - Compressed
No ratings yet
Semester I - Problem Solving and Python Programming (Ge8151) - Compressed
343 pages
PLSQL New MCQ Part 2 (Lesson 1-5)
No ratings yet
PLSQL New MCQ Part 2 (Lesson 1-5)
13 pages