Reinforcement learning is a paradigm that aims to model the trial-and-error learning process that is needed in many problem situations where explicit instructive signals are not available. It has roots in operations research, behavioral psychology and AI. The goal of the course is to introduce the basic mathematical foundations of reinforcement learning, as well as highlight some of the recent directions of research.
The tables below enlists the courses materials for Week 0 to Week 12. Each topic has both YouTube link and VideoKen link.
Week 0 - Preparatory Material
Probability tutorial - 1 |
|
|
Probability tutorial - 2 |
|
|
Linear algebra tutorial - 1 |
|
|
Linear algebra tutorial - 2 |
|
|
Assignment 0 |
Solution 0 |
Introduction to RL |
|
|
RL framework and applications |
|
|
Introduction to immediate RL |
|
|
Bandit optimalities |
|
|
Value function based methods |
|
|
Assignment 1 |
Solution 1 |
Week 2 - Bandit Algorithms
UCB 1 |
|
|
Concentration bounds |
|
|
UCB 1 Theorem |
|
|
PAC bounds |
|
|
Median elimination |
|
|
Thompson sampling |
|
|
Assignment 2 |
Solution 2 |
Additional Reads
Auer, P.; Cesa-Bianchi, N.; Fischer, P. 2002. Finite-time Analysis of the Multiarmed Bandit Problem.
Auer, P.; Ortner, R. 2010. UCB Revisited: Improved Regret Bounds for the Stochastic Multi-Armed Bandit Problem.
Even-Dar, E.; Mannor, S.; Mansour, Y. 2006. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems.
Tutorial on OFUL (Szepesvari, C.) Part 1 | Part 2 | Part 3
Week 3 - Policy Gradient Methods & Introduction to Full RL
Policy search |
|
|
REINFORCE |
|
|
Contextual bandits |
|
|
Full RL introduction |
|
|
Returns, value functions & MDPs |
|
|
Assignment 3 |
Solution 3 |
Additional Reads
Notes on REINFORCE algorithm
MDP modelling |
|
|
Bellman equation |
|
|
Bellman optimality equation |
|
|
Cauchy sequence & Green's equation |
|
|
Banach fixed point theorem |
|
|
Convergence proof |
|
|
Assignment 4 |
Solution 4 |
Week 5 - Dynamic Programming & Monte Carlo Methods
LPI convergence |
|
|
Value iteration |
|
|
Policy iteration |
|
|
Dynamic programming |
|
|
Monte Carlo |
|
|
Control in Monte Carlo |
|
|
Assignment 5 |
Solution 5 |
Week 6 - Monte Carlo & Temporal Difference Methods
Week 7 - Eligibility Traces
Eligibility traces |
|
|
Backward view of eligibility traces |
|
|
Eligibility trace control |
|
|
Thompson sampling recap |
|
|
Assignment 7 |
Solution 7 |
Week 8 - Function Approximation
Function approximation |
|
|
Linear parameterization |
|
|
State aggregation methods |
|
|
Function approximation & eligibility traces |
|
|
LSTD & LSTDQ |
|
|
LSPI & Fitted Q |
|
|
Assignment 8 |
Solution 8 |
Week 9 - DQN, Fitted Q & Policy Gradient Approaches
DQN & Fitted Q-iteration |
|
|
Policy gradient approach |
|
|
Actor critic & REINFORCE |
|
|
REINFORCE (cont'd) |
|
|
Policy gradient with function approximation |
|
|
Assignment 9 |
Solution 9 |
Additional Reads
Notes on Policy Gradient Algorithms
Week 10 - Hierarchical Reinforcement Learning
Hierarchical reinforcement learning |
|
|
Types of optimality |
|
|
Semi-Markov decision processes |
|
|
Options |
|
|
Learning with options |
|
|
Hierarchical abstract machines |
|
|
Assignment 10 |
Solution 10 |
Additional Reads
Andrew G. Barto and Sridhar Mahadevan. 2003. Recent Advances in Hierarchical Reinforcement Learning. Discrete Event Dynamic Systems 13, 1-2 (January 2003), 41-77. DOI: https://doi.org/10.1023/A:1022140919877
Week 11 - Hierarchical RL: MAXQ
Additional Reads
Dietterich, T. G. 2000. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition.
Week 12 - POMDPs
Additional Reads
POMDP Tutorial
Tutorial on Predictive State Representations (Singh, S. P.)