Posts

Math of Intelligence : Temporal Difference Learning
Math Of Intelligence : Temporal Difference Learning

Monte Carlo:
$\begin{equation} V(S_{t}) \leftarrow V(S_{t}) + \alpha (G_t - V(S_{t+1})) \end{equation}$
Monte Carlo methods wait until the end of episode to update the state value function $V(S_t)$ for a state $S_t$ where $G_t$ is the actual return at time t.

TD Learning:

Also, $\begin{equation} v_\pi(s) = E_\pi[G_t | S_t = s] \\ = E_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s] \end{equation}$

So, eq. (1) becomes
$\begin{equation} V(S_t) \leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1})-V(S_t)] \end{equation}$
With this equation, $V(S_t)$ can be updated as soon as $R_{t+1}$ is received. This is called TD(0) or one-step TD Learning.
$\begin{equation} V(S_t) \leftarrow V(S_t) + \alpha[G_t - V(S_t)] \end{equation}$ $\begin{equation} V(S_t) \leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1})-V(S_t)] \end{equation}$
After taking an action at time t, we know the value of $R_{t+1}$ as the reward received, so we can update the state value of the state $S_t$ based on the above equation.

This is called $TD(0)$ or one step TD. It is called so because it is a special case of $TD(\lambda)$ where $\lambda=0$

In TD method, the value in brackets is called TD-error $(\delta_{t})$ $\delta_t = G_t - V(S_t) \\ = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$

TD method convergence proof? - TODO
Which one converges faster? TD or MC? How do we formalize this question? - TODO
Similarly for action value function: $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha(R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t))$

SARSA :

On policy TD Control The next action is picked based on current policy + $\epsilon$ - greedy. $Q(s,a)$ is learned from actions taken from the current policy $\pi$

Q-Learning:

Off policy TD control We pick next action based on max Q-values + $\epsilon$ - greedy. $Q(s,a)$ value does not depend on policy $\pi$

So, $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha(R_{t+1} + \gamma max_a Q(S_{t+1},a) - Q(S_t, A_t))$

References:

Richard S. Sutton, Andrew G. Barto - Reinforcement Learning: An Introduction
Math of Intelligence : Dynamic Programming for Markov Decision Process
Math Of Intelligence : Dynamic Programming

For a random policy $\pi$
$\begin{equation} V_{\pi}(s) = E_{\pi} [r+ \gamma V_{\pi} | S_{t} = s] = \sum_{a \in A} \pi (a|s) \sum_{s', r} P(s', r | s,a)(r+\gamma V_{\pi}(s')) \end{equation}$ $\begin{equation} Q_{\pi}(s,a) = E[R_{t+1} + \gamma V_{\pi}(S_{t+1} | S_{t} = s, A_{t} = a] = \sum_{s', r} P(s', r | s,a)(r+\gamma V_{\pi}(s')) \end{equation}$
Since this is a model based algorithm, we know the value of using which we can calculate first state value function $V_{\pi}(s)$ & then action value function $Q_{\pi}(s,a) \forall s \in S, a \in A$ Then we can update our policy $\pi$ to be actions that maximize the state value of that state.
$\pi'(s) \leftarrow argmax a \in A(s) Q_{\pi}(s,a)$
Continue iterating for better policies.
Nov 11, 2018 math-of-intelligence, reinforcement-learning, artificial-intelligence
Math of Intelligence : Markov Decision Process
Math Of Intelligence : Markov Decision Process

What is a Markov Decision Process?

A Markov Decision Process consists of 5 elements: S, A, P, R, $\gamma$

S $\rightarrow$ set of states

A $\rightarrow$ set of actions

R $\rightarrow$ reward function

P $\rightarrow$ transition probability function: P(s’,r s,a)

$\gamma$ $\rightarrow$ discounting factor

The states of an MDP have a property that:
$P[S_{t+1} \space | \space S_t]=P[S_{t+1}\space | \space S_1, S_2, .... S_t]$
It means that the future depends on the current state and not on the history of all previous states.

Bellman Equations

$V(s)$ is the state value function. It describes the expected return given the current state s and Q(s,a) is the action value function which describes the expected return given the current state s and the action a, that the agent takes from state s.
$\begin{equation} V(s) = E[G_t \space | \space S_t=s] \end{equation}$
Here, $G_t$ is the expected return at time t. That is the expected sum of rewards that we will get after time t. So, $G_t$ can be represented as $R_{t+1} + \gamma R_{t+2} + \gamma ^2 R_{t+3} ...$ where $\gamma$ is the discount factor.

Now, Eq. (1) becomes:
$V(s) = E[R_{t+1} + \gamma R_{t+2} + \gamma ^2 R_{t+3} ... \space | \space S_t=s]$ $= E[R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} ...) \space | \space S_t=s]$ $= E[R_{t+1} + \gamma (G_{t+1}) \space | \space S_t=s]$ $\begin{equation} = E[R_{t+1} + \gamma (V(S_{t+1})) \space | \space S_t=s] \end{equation}$
Similarly, for Q-value,
$\begin{equation} Q(s,a)=E[R_{t+1}+\gamma Q(S_{t+1}, A_{t+1})\space | \space S_t=s, A_t=a] \end{equation}$
Bellman Expectation Equations:
$\begin{equation} V_\pi(s) = \sum_{a \in A}^{} \pi (a | s) Q_\pi(s,a) \end{equation}$ $\begin{equation} Q_\pi(s,a) = R(s,a) + \gamma \sum_{s' \in S}^{} P_{ss'}^{a}V_{\pi}(s') \end{equation}$ $\begin{equation} V_\pi(s) = \sum_{a \in A}^{} \pi (a | s)(R(s,a) + \gamma \sum_{s' \in S}^{} P_{ss'}^{a}V_{\pi}(s')) \end{equation}$ $\begin{equation} Q_\pi(s,a) = R(s,a) + \gamma \sum_{s' \in S}^{} P_{ss'}^{a} \sum_{a' \in A}^{} \pi (a' | s') Q_\pi(s',a') \end{equation}$
Bellman Optimality Equations

Lets find out the optimal values for state value and action value functions:
$\begin{equation} V_{*}(s) = argmax_{a \in A} Q_{*}(s, a) \end{equation}$ $\begin{equation} Q_{*}(s,a) = R(s, a) + \gamma \sum_{s' \in S} P_{ss'}^{a}V_{*}(s') \end{equation}$ $\begin{equation} V_{*}(s) = argmax_{a \in A}(R(s, a) + \gamma \sum_{s' \in S} P_{ss'}^{a}V_{*}(s')) \end{equation}$ $\begin{equation} Q_{*}(s,a) = R(s, a) + \gamma \sum_{s' \in S} P_{ss'}^{a} argmax_{a' \in A} Q_{*}(s', a') \end{equation}$
Nov 3, 2018 math-of-intelligence, machine-learning, artificial-intelligence
Math of Intelligence : Logistic Regression
Math Of Intelligence : Logistic Regression

Here, we will be figuring out the math for a binary logistic classifier.

Logistic Regression is similar to Linear Regression but instead of a real valued output $y$ , it will be either 0 or 1 since we need to classify into one of 2 categories.

In the linear regression post, we have defined our hypothesis function as:
$\begin{equation} h_\theta(x) = \theta_0 + \theta_1x \end{equation}$
Now, we can also have multiple input features i.e $x_1, x_2, x_3...$ and so on, so in that case our hypothesis function becomes:
$\begin{equation} h_\theta(x) = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + \theta_1x_3 .... \end{equation}$
We have added $x_0=1$ with $\theta_0$ for simplification. Now, the hypothesis function can be expressed as a combination of just 2 vectors: $X=[x_0, x_1, x_2, x_3, ...]$ and $\theta = [\theta_0, \theta_1, \theta_2, ...]$
$\begin{equation} h_\theta(x) = \theta^TX \end{equation}$
Still, the output of this function will be a real value, so we’ll apply an activation function to convert the output to 0 or 1. We’ll use the sigmoid function $g(z)$ for this purpose. TODO: Explore other activation functions

\begin{equation} g(z) = \frac{1}{1+e^{-z}} \end{equation}

\begin{equation} h(X) = g(\theta^TX) = \frac{1}{1+e^{-\theta^TX}} \end{equation}

The most commonly used loss function for logistic regression is log-loss (or cross-entropy) TODO: Why log-loss? Explore other loss functions.

So, the loss function $l(\theta)$ for $m$ training examples is:
$\begin{equation} l(\theta) = -\frac{1}{m}(\sum_{i=1}^m y^{(i)}log(h(x^{(i)}) + (1-y^{(i)})log(1-h(x^{(i)})) \end{equation}$
which can also be represented as:
$\begin{equation} l(\theta) = -(\sum_{i=1}^m y^{(i)}log(g(\theta^T x^{(i)})) + (1-y^{(i)})log(1-g(\theta^T x^{(i)})) \end{equation}$
Now, similar to linear regression, we need to find out the value of $\theta$ that minimizes the loss. We can again use gradient descent for that. TODO: Explore other methods to minimize the loss function.
$\begin{equation} \theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} l(\theta) \end{equation}$
where $\alpha$ is the learning rate.

From (8), we get that we need to find out $\frac{\partial}{\partial \theta_j} l(\theta)$ to derive the gradient descent rule. Lets start by working with just one training example.

$\frac{\partial}{\partial \theta_j} l(\theta)$ can be broken down as follows:
$\begin{equation} \frac{\partial}{\partial \theta} l(\theta) = \frac{\partial}{\partial h(x)}l(\theta).\frac{\partial}{\partial \theta}h(x) \end{equation}$ $\begin{equation} \frac{\partial}{\partial \theta} l(\theta) = \frac{\partial}{\partial g(\theta^T x)}l(\theta).\frac{\partial}{\partial \theta}g(\theta^Tx) \end{equation}$
Calculating $\frac{\partial}{\partial \theta}g(\theta^Tx)$ first:
$\frac{\partial}{\partial \theta}g(\theta^Tx) = \frac{\partial}{\partial \theta} \left(\frac{1}{1+e^{-\theta^T x}}\right)$ $= \frac{\partial}{\partial \theta}({1+e^{-\theta^T x}})^{-1}$
Using the chain rule of derivatives,
$=-({1+e^{-\theta^T x}})^{-2}.(e^{-\theta^T x}).(-x)$ $=\frac{e^{-\theta^T x}}{(1+e^{-\theta^T x})^2}.(x)$ $=\frac{1+e^{-\theta^T x}-1}{(1+e^{-\theta^T x})^2}.(x)$ $=\left(\frac{1+e^{-\theta^T x}}{(1+e^{-\theta^T x})^2}-\frac{1}{(1+e^{-\theta^T x})^2}\right).(x)$ $=\left(\frac{1}{(1+e^{-\theta^T x})}-\frac{1}{(1+e^{-\theta^T x})^2}\right).(x)$ $=(g(\theta^T x)-g(\theta^T x)^2).(x)$ $\begin{equation} \frac{\partial}{\partial \theta}g(\theta^Tx) =g(\theta^T x)(1-g(\theta^T x).x \end{equation}$
Now, calculating $\frac{\partial}{\partial g(\theta^T x)}l(\theta)$ ,
$\frac{\partial}{\partial g(\theta^T x)}l(\theta) = \frac{\partial}{\partial g(\theta^T x)}.(-(y.log(g(\theta^T x) + (1-y)log(1-g(\theta^T x)))$
Again, using the chain rule,
$= -\left(\frac{y}{g(\theta^T x)} + \frac{1-y}{1-g(\theta^T x)}.(-1)\right)$ $= -\left(\frac{y-y.g(\theta^T x)-g(\theta^T x)+y.g(\theta^T x)}{g(\theta^T x).(1-g(\theta^T x)}\right)$ $= -\left(\frac{y-g(\theta^T x)}{g(\theta^T x).(1-g(\theta^T x)}\right)$ $\begin{equation} \frac{\partial}{\partial g(\theta^T x)}l(\theta) = -\left(\frac{y-g(\theta^T x)}{g(\theta^T x).(1-g(\theta^T x)}\right) \end{equation}$
Finally, combining (10),(11),(12), we get
$\frac{\partial}{\partial \theta} l(\theta) = -\left(\frac{y-g(\theta^T x)}{g(\theta^T x).(1-g(\theta^T x)}\right).g(\theta^T x)(1-g(\theta^T x).x$ $\frac{\partial}{\partial \theta} l(\theta) = -(y-g(\theta^T x)).x$ $\begin{equation} \frac{\partial}{\partial \theta} l(\theta) = -(y-h(x)).x \end{equation}$
Plugging this back in (8),
$\begin{equation} \theta_j = \theta_j + \alpha(y-h(x)).x \end{equation}$
Nov 2, 2018 math-of-intelligence, machine-learning, artificial-intelligence
Math of Intelligence : Linear Regression
Math Of Intelligence : Linear Regression

Let $x$ be the input feature and $y$ is the output that we are interested in.

For linear regression, we need a hypothesis function that predicts y, given the input feature x.

Let us assume that y is linearly dependent on x, so our hypothesis function is:

Here $\theta_i$ ’s are the parameters(or weights). To simplify the notation, we will drop the $\theta$ in the subscript of $h_\theta(x)$ and mention it simply as $h(x)$ .

Now, we need to find a way to measure the error between our predicted output $h(x)$ and the actual value y for all our training examples.

One way to measure this error is the ordinary least squared method. TODO: Explore other cost functions

So, the cost function(or loss function)* $J(\theta)$ according to the ordinary least square method will be as follows:

*there’s some debate about whether they are the same or not but for now we’ll assume they are the same

On expanding $h(x)$ , we get

Our objective is to find the values of $\theta_0$ and $\theta_1$ that minimize the loss function.

One way to do this is by using the Gradient descent method. TODO:Explore other methods to find the global minima of a function

In this method, we first initialize $\theta_j$ randomly and then update it according to the above rule to come closer the minima with each update.

Here, $\alpha$ is the learning rate.

Hence, in order to update $\theta_j$ , we need to find out the partial derivative of $J(\theta)$ w.r.t. $\theta_j$ . In our case j = 0 and 1

w.r.t. $\theta_0$

w.r.t. $\theta_1$

Combining equations (4) and (6) as well as (4) and (8) we get:

The above equations can be used to update the weights and hence improve the hypothesis function with every training example.

References:
1. https://see.stanford.edu/materials/aimlcs229/cs229-notes1.pdf

Posts

Math of Intelligence : Temporal Difference Learning

Math Of Intelligence : Temporal Difference Learning

Monte Carlo:

TD Learning:

SARSA :

Q-Learning:

References:

Math of Intelligence : Dynamic Programming for Markov Decision Process

Math Of Intelligence : Dynamic Programming

Math of Intelligence : Markov Decision Process

Math Of Intelligence : Markov Decision Process

What is a Markov Decision Process?

Bellman Equations

Bellman Expectation Equations:

Bellman Optimality Equations

Math of Intelligence : Logistic Regression

Math Of Intelligence : Logistic Regression

Math of Intelligence : Linear Regression

Math Of Intelligence : Linear Regression

References: