Math of Intelligence : Temporal Difference Learning
Math Of Intelligence : Temporal Difference Learning
Monte Carlo:
Monte Carlo methods wait until the end of episode to update the state value function for a state where is the actual return at time t.
TD Learning:
Also,
So, eq. (1) becomes
With this equation, can be updated as soon as is received. This is called TD(0) or one-step TD Learning.
After taking an action at time t, we know the value of as the reward received, so we can update the state value of the state based on the above equation.
This is called or one step TD. It is called so because it is a special case of where
In TD method, the value in brackets is called TD-error
TD method convergence proof? - TODO
Which one converges faster? TD or MC? How do we formalize this question? - TODO
Similarly for action value function:
SARSA :
On policy TD Control The next action is picked based on current policy + - greedy. is learned from actions taken from the current policy
Q-Learning:
Off policy TD control We pick next action based on max Q-values + - greedy. value does not depend on policy
So,
References:
Richard S. Sutton, Andrew G. Barto - Reinforcement Learning: An Introduction