Skip to content

ZosoV/rl_comparison

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Minimal RL: Fundamental Reinforcement Learning Algorithms Comparison

State: In Progress

Over time, I intend to compare the functionality and implement well known RL algorithm. I'm going to focus on the model-free domain, but over time I would like to cover the whole following map.

NOTE: I intend to write only small implementations of the algorithms when possible and based on other sources on the Internet. I don't intend to do hard detailed implementations of the algorithms. This source is only for understanding.

Algorithm Type Policy Formula Algorithm Key Points Other Formulas
Q-Learning Value-based $\pi(s) = \arg\max_a Q(s, a)$ Learns a Q-function for action-value estimation using the Bellman equation.
Updates Q-values via:
$Q(s, a) \leftarrow Q(s, a) + \alpha [R(s, a) + \gamma \max_{a'} Q(s', a') - Q(s, a)]$.
TD Error: $\delta = R(s, a) + \gamma \max_{a'} Q(s', a') - Q(s, a)$
Deep Q-Learning (DQN) Value-based $\pi(s) = \arg\max_a Q_\theta(s, a)$ Extension of Q-learning using deep neural networks for Q-function approximation.
Implements experience replay and fixed Q-targets.
Loss Function:
$L(\theta) = \mathbb{E} \left[ \left( R(s, a) + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta) \right)^2 \right]$
$\theta = \arg\max_\theta L(\theta)$
Loss Function: $L(\theta)$
REINFORCE Policy-based $\pi_\theta(a|s)$ Directly learns a stochastic policy using full episode returns.
Policy update proportional to the gradient of log probability of the policy weighted by return.
Policy Gradient:
$J\left(\pi_\theta\right)=\underset{\tau \sim \pi_\theta}{\mathrm{E}}[R(\tau)]$
$\nabla_\theta J\left(\pi_\theta\right)=\underset{\tau \sim \pi_\theta}{\mathrm{E}}\left[\sum\limits_{t=0}^T \Phi_t \nabla_\theta \log \pi_\theta\left(a_t \mid s_t\right) \right]$
$\theta=\arg \underset{\theta}{\min} \quad J\left(\pi_\theta\right)$
The weight $\Phi_t$ can be:
$\Phi_t = R(\tau)$
$\Phi_t=\sum\limits_{t^{\prime}=t}^T R\left(s_{t^{\prime}}, a_{t^{\prime}}, s_{t^{\prime}+1}\right)$
$\Phi_t=\sum\limits_{t^{\prime}=t}^T R\left(s_{t^{\prime}}, a_{t^{\prime}}, s_{t^{\prime}+1}\right)-b\left(s_t\right)$
others:
$\Phi_t = Q^{\pi_\phi}(s_t,a_t)$
$\Phi_t = A^{\pi_\phi}(s_t,a_t)$
Advantage Actor-Critic (A2C) Actor-Critic Actor: $\pi_\theta(a|s)$
Critic: $V_\phi(s)$
Combines value and policy methods, with an actor for actions and a critic for evaluation.
Actor updates policy based on critic's advice.
Critic evaluates action based on current policy.
Uses advantage function to reduce update variance.
$\nabla_\theta J\left(\pi_\theta\right)=\underset{\tau \sim \pi_\theta}{\mathrm{E}}\left[\sum\limits_{t=0}^T \Phi_t \nabla_\theta \log \pi_\theta\left(a_t \mid s_t\right) \right]$
$\theta=\arg \underset{\theta}{\min} \quad J\left(\pi_\theta\right)$
$\phi_k=\arg \underset{\phi}{\min} \underset{s_t, R_t \sim \pi_k}{\mathrm{E}}\left[\left(V_\phi\left(s_t\right)-\hat{R}_t\right)^2 \right]$.
$\Phi_t=\sum\limits_{t^{\prime}=t}^T R\left(s_{t^{\prime}}, a_{t^{\prime}}, s_{t^{\prime}+1}\right)-V_\phi\left(s_t\right)$
can be seen as an estimate of the Advantage Function:
$A(s, a) = Q(s, a) - V(s)$

NOTE: $\hat{R}_t$ can be modeled with the infinite discounted return or finite undiscounted return.

TODOS

  • Complete note 001_DQN
  • Write a short summary of 003 REINFORCE
  • Code A2C
  • Code REINFORCE
  • Review TRPO

Paper Reviews

Extra Sources

About

I intend to compare the functionality of two RL algorithms DQN (Value Based) and Actor Critic (Policy based).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors