Math Of Intelligence : Dynamic Programming

For a random policy

Since this is a model based algorithm, we know the value of using which we can calculate first state value function & then action value function Then we can update our policy to be actions that maximize the state value of that state.

Continue iterating for better policies.