Reinforcement Learning

Reinforcement learning is a paradigm that aims to model the trial-and-error learning process that is needed in many problem situations where explicit instructive signals are not available. It has roots in operations research, behavioral psychology and AI. The goal of the course is to introduce the basic mathematical foundations of reinforcement learning, as well as highlight some of the recent directions of research.

The tables below enlists the courses materials for Week 0 to Week 12. Each topic has both YouTube link and VideoKen link.

Week 0 - Preparatory Material

Probability tutorial - 1
Probability tutorial - 2
Linear algebra tutorial - 1
Linear algebra tutorial - 2
Assignment 0
Solution 0

Week 1 - Introduction to RL and Immediate RL

Introduction to RL
RL framework and applications
Introduction to immediate RL
Bandit optimalities
Value function based methods
Assignment 1
Solution 1

Week 2 - Bandit Algorithms

UCB 1
Concentration bounds
UCB 1 Theorem
PAC bounds
Median elimination
Thompson sampling
Assignment 2
Solution 2

Additional Reads

Auer, P.; Cesa-Bianchi, N.; Fischer, P. 2002. Finite-time Analysis of the Multiarmed Bandit Problem.

Auer, P.; Ortner, R. 2010. UCB Revisited: Improved Regret Bounds for the Stochastic Multi-Armed Bandit Problem.

Even-Dar, E.; Mannor, S.; Mansour, Y. 2006. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems.

Tutorial on OFUL (Szepesvari, C.) Part 1 | Part 2 | Part 3

Week 3 - Policy Gradient Methods & Introduction to Full RL

Policy search
REINFORCE
Contextual bandits
Full RL introduction
Returns, value functions & MDPs
Assignment 3
Solution 3

Additional Reads

Notes on REINFORCE algorithm

Week 4 - MDP Formulation, Bellman Equations & Optimality Proofs

MDP modelling
Bellman equation
Bellman optimality equation
Cauchy sequence & Green's equation
Banach fixed point theorem
Convergence proof
Assignment 4
Solution 4

Week 5 - Dynamic Programming & Monte Carlo Methods

LPI convergence
Value iteration
Policy iteration
Dynamic programming
Monte Carlo
Control in Monte Carlo
Assignment 5
Solution 5

Week 6 - Monte Carlo & Temporal Difference Methods

Off Policy MC
UCT
TD(0)
TD(0) control
Q-learning
Afterstate
Assignment 6
Solution 6

Week 7 - Eligibility Traces

Eligibility traces
Backward view of eligibility traces
Eligibility trace control
Thompson sampling recap
Assignment 7
Solution 7

Week 8 - Function Approximation

Function approximation
Linear parameterization
State aggregation methods
Function approximation & eligibility traces
LSTD & LSTDQ
LSPI & Fitted Q
Assignment 8
Solution 8

Week 9 - DQN, Fitted Q & Policy Gradient Approaches

DQN & Fitted Q-iteration
Policy gradient approach
Actor critic & REINFORCE
REINFORCE (cont'd)
Policy gradient with function approximation
Assignment 9
Solution 9

Additional Reads

Notes on Policy Gradient Algorithms

Week 10 - Hierarchical Reinforcement Learning

Hierarchical reinforcement learning
Types of optimality
Semi-Markov decision processes
Options
Learning with options
Hierarchical abstract machines
Assignment 10
Solution 10