Deep Reinforcement Learning
Deep reinforcement learning (deep RL) combines the representational power of deep neural networks with the decision-making framework of reinforcement learning. Rather than learning from labeled data, a deep RL agent learns by interacting with an environment: taking actions, observing outcomes, and adjusting its policy to maximize cumulative reward.
This paradigm has produced some of the most striking demonstrations of AI capability — from defeating world champions at Go and chess to controlling plasma in nuclear fusion reactors — and has become a central research direction at the intersection of AI and neuroscience, given its connections to reward learning in biological brains.
The Reinforcement Learning Framework
Section titled “The Reinforcement Learning Framework”At each timestep , an agent observes state , selects action according to its policy , receives reward , and transitions to state . The goal is to find a policy that maximizes the expected discounted return:
where is the discount factor weighting immediate versus future rewards.
The two central quantities are:
- Value function — the expected return from state under policy .
- Action-value function — the expected return from taking action in state , then following .
Temporal Difference Learning
Section titled “Temporal Difference Learning”Temporal difference (TD) methods learn value functions by bootstrapping — updating estimates based on other estimates rather than waiting for the episode to end. The TD(0) update is:
The bracketed term is the TD error — the difference between the predicted and observed value. Remarkably, this signal closely matches the firing pattern of midbrain dopamine neurons, providing a direct link between RL theory and neuroscience (see Neuroscience-Inspired AI).
Key Algorithms
Section titled “Key Algorithms”Deep Q-Networks (DQN)
Section titled “Deep Q-Networks (DQN)”DQN (Mnih et al., 2015) was the first algorithm to demonstrate human-level performance on Atari games directly from pixels. Key innovations:
- Neural network function approximation — A CNN parameterizes the Q-function, mapping raw pixel observations to Q-values for each action.
- Experience replay — Transitions are stored in a replay buffer and sampled randomly to break temporal correlations during training.
- Target network — A periodically updated copy of the Q-network provides stable regression targets.
Policy Gradient Methods
Section titled “Policy Gradient Methods”Instead of learning a value function and deriving a policy, policy gradient methods directly optimize the policy parameters using the gradient of expected return:
REINFORCE is the simplest policy gradient algorithm. Variance is reduced by subtracting a baseline (often the value function), yielding the advantage .
Actor-Critic Methods
Section titled “Actor-Critic Methods”Actor-critic architectures maintain both a policy (actor) and a value function (critic). The critic evaluates the actor’s actions, providing a low-variance training signal. Modern variants include:
- A3C / A2C (Asynchronous Advantage Actor-Critic) — Multiple agents explore in parallel, decorrelating experience.
- PPO (Proximal Policy Optimization) — Clips the policy update ratio to prevent destructively large steps; widely used for its stability and simplicity.
- SAC (Soft Actor-Critic) — Maximizes both return and entropy, encouraging exploration and robustness in continuous control.
Model-Based RL
Section titled “Model-Based RL”Model-free methods learn directly from interaction; model-based methods additionally learn a model of the environment’s dynamics, enabling planning and sample efficiency:
- Dyna-Q — Augments Q-learning with simulated experience from a learned model.
- MuZero (DeepMind) — Learns a latent dynamics model and plans using Monte Carlo Tree Search, achieving superhuman performance across board games and Atari without knowing the game rules.
- Dreamer / DreamerV3 — Learns a world model in a latent space and trains a policy entirely through imagination, achieving strong performance on visual control tasks.
Exploration vs. Exploitation
Section titled “Exploration vs. Exploitation”A fundamental challenge: the agent must balance exploiting its current knowledge to collect reward with exploring to discover potentially better strategies. Common approaches:
- ε-greedy — With probability ε take a random action; otherwise act greedily.
- Intrinsic motivation — Add a reward bonus for visiting novel states (e.g., based on prediction error or count-based exploration).
- Thompson sampling / posterior sampling — Maintain uncertainty over the Q-function and sample from it to guide exploration.
Applications
Section titled “Applications”- Game playing — AlphaGo, AlphaZero, MuZero, OpenAI Five (Dota 2)
- Robotics — Manipulation, locomotion, dexterous hand control
- Science — Plasma control in tokamak fusion reactors (DeepMind × Google), protein structure prediction auxiliary tasks
- LLM alignment — RLHF uses PPO to train language models to follow human preferences
- Drug discovery and materials science — Molecular design as sequential decision-making
Key Concepts
Section titled “Key Concepts”- Markov Decision Process (MDP) — The mathematical framework underlying RL: states, actions, transition probabilities, and rewards.
- On-policy vs. Off-policy — On-policy methods (PPO, A2C) learn about the policy being followed; off-policy methods (DQN, SAC) can learn from data generated by any policy.
- Sample Efficiency — The amount of environment interaction required to learn a good policy. Model-based and off-policy methods tend to be more sample efficient.
- Reward Shaping — Modifying the reward signal to guide learning, with care taken not to change the optimal policy.
- Sim-to-Real Transfer — Training in simulation and deploying in the real world; requires careful domain randomization to bridge the gap.
Resources
Section titled “Resources”Reinforcement Learning: An Introduction
Sutton & Barto — the canonical RL textbook, freely available online
incompleteideas.net
Spinning Up in Deep RL
OpenAI's practical introduction to deep RL with code, algorithms, and exercises
spinningup.openai.com
HuggingFace Deep RL Course
Hands-on deep RL course with Google Colab notebooks, from Q-learning to PPO
huggingface.co
Playing Atari with Deep Reinforcement Learning
Mnih et al., 2013 — the original DQN paper
arxiv.org