Minimal RL: Fundamental Reinforcement Learning Algorithms Comparison

State: In Progress

Over time, I intend to compare the functionality and implement well known RL algorithm. I'm going to focus on the model-free domain, but over time I would like to cover the whole following map.

NOTE: I intend to write only small implementations of the algorithms when possible and based on other sources on the Internet. I don't intend to do hard detailed implementations of the algorithms. This source is only for understanding.

Algorithm	Type	Policy Formula	Algorithm Key Points	Other Formulas
Q-Learning	Value-based	$\pi(s) = \arg\max_a Q(s, a)$	Learns a Q-function for action-value estimation using the Bellman equation. Updates Q-values via: $Q(s, a) \leftarrow Q(s, a) + \alpha [R(s, a) + \gamma \max_{a'} Q(s', a') - Q(s, a)]$.	TD Error: $\delta = R(s, a) + \gamma \max_{a'} Q(s', a') - Q(s, a)$
Deep Q-Learning (DQN)	Value-based	$\pi(s) = \arg\max_a Q_\theta(s, a)$	Extension of Q-learning using deep neural networks for Q-function approximation. Implements experience replay and fixed Q-targets. Loss Function: $L(\theta) = \mathbb{E} \left[ \left( R(s, a) + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta) \right)^2 \right]$ $\theta = \arg\max_\theta L(\theta)$	Loss Function: $L(\theta)$
REINFORCE	Policy-based	$\pi_\theta(a\|s)$	Directly learns a stochastic policy using full episode returns. Policy update proportional to the gradient of log probability of the policy weighted by return. Policy Gradient: $J\left(\pi_\theta\right)=\underset{\tau \sim \pi_\theta}{\mathrm{E}}[R(\tau)]$ $\nabla_\theta J\left(\pi_\theta\right)=\underset{\tau \sim \pi_\theta}{\mathrm{E}}\left[\sum\limits_{t=0}^T \Phi_t \nabla_\theta \log \pi_\theta\left(a_t \mid s_t\right) \right]$ $\theta=\arg \underset{\theta}{\min} \quad J\left(\pi_\theta\right)$	The weight $\Phi_t$ can be: $\Phi_t = R(\tau)$ $\Phi_t=\sum\limits_{t^{\prime}=t}^T R\left(s_{t^{\prime}}, a_{t^{\prime}}, s_{t^{\prime}+1}\right)$ $\Phi_t=\sum\limits_{t^{\prime}=t}^T R\left(s_{t^{\prime}}, a_{t^{\prime}}, s_{t^{\prime}+1}\right)-b\left(s_t\right)$ others: $\Phi_t = Q^{\pi_\phi}(s_t,a_t)$ $\Phi_t = A^{\pi_\phi}(s_t,a_t)$
Advantage Actor-Critic (A2C)	Actor-Critic	Actor: $\pi_\theta(a\|s)$ Critic: $V_\phi(s)$	Combines value and policy methods, with an actor for actions and a critic for evaluation. Actor updates policy based on critic's advice. Critic evaluates action based on current policy. Uses advantage function to reduce update variance. $\nabla_\theta J\left(\pi_\theta\right)=\underset{\tau \sim \pi_\theta}{\mathrm{E}}\left[\sum\limits_{t=0}^T \Phi_t \nabla_\theta \log \pi_\theta\left(a_t \mid s_t\right) \right]$ $\theta=\arg \underset{\theta}{\min} \quad J\left(\pi_\theta\right)$ $\phi_k=\arg \underset{\phi}{\min} \underset{s_t, R_t \sim \pi_k}{\mathrm{E}}\left[\left(V_\phi\left(s_t\right)-\hat{R}_t\right)^2 \right]$.	$\Phi_t=\sum\limits_{t^{\prime}=t}^T R\left(s_{t^{\prime}}, a_{t^{\prime}}, s_{t^{\prime}+1}\right)-V_\phi\left(s_t\right)$ can be seen as an estimate of the Advantage Function: $A(s, a) = Q(s, a) - V(s)$

NOTE: $\hat{R}_t$ can be modeled with the infinite discounted return or finite undiscounted return.

TODOS

Paper Reviews

Q-Learning:
DQN: Code | Summary | Playing Atari with Deep Reinforcement Learning | There is another version in Nature.
REINFORCE: Intro to Policy Optimization | Benchmarking Deep Reinforcement Learning for Continuous Control
A2C / A3C: Summary | Asynchronous Methods for Deep Reinforcement Learning

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
assets		assets
notes		notes
utils		utils
.gitignore		.gitignore
001_dqn.ipynb		001_dqn.ipynb
LICENSE		LICENSE
README.md		README.md
rl_environment.yml		rl_environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Minimal RL: Fundamental Reinforcement Learning Algorithms Comparison

TODOS

Paper Reviews

Extra Sources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Minimal RL: Fundamental Reinforcement Learning Algorithms Comparison

TODOS

Paper Reviews

Extra Sources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages