Math of Intelligence : Dynamic Programming for Markov Decision Process
Math Of Intelligence : Dynamic Programming
For a random policy
Since this is a model based algorithm, we know the value of using which we can calculate first state value function & then action value function Then we can update our policy to be actions that maximize the state value of that state.
Continue iterating for better policies.