Posts

  • Math of Intelligence : Temporal Difference Learning

    Math Of Intelligence : Temporal Difference Learning

    Monte Carlo:

    Monte Carlo methods wait until the end of episode to update the state value function for a state where is the actual return at time t.

    TD Learning:

    Also,

    So, eq. (1) becomes

    With this equation, can be updated as soon as is received. This is called TD(0) or one-step TD Learning.

    After taking an action at time t, we know the value of as the reward received, so we can update the state value of the state based on the above equation.

    This is called or one step TD. It is called so because it is a special case of where

    In TD method, the value in brackets is called TD-error

    TD method convergence proof? - TODO
    Which one converges faster? TD or MC? How do we formalize this question? - TODO
    Similarly for action value function:

    SARSA :

    On policy TD Control The next action is picked based on current policy + - greedy. is learned from actions taken from the current policy

    Q-Learning:

    Off policy TD control We pick next action based on max Q-values + - greedy. value does not depend on policy

    So,

    References:

    Richard S. Sutton, Andrew G. Barto - Reinforcement Learning: An Introduction

  • Math of Intelligence : Dynamic Programming for Markov Decision Process

    Math Of Intelligence : Dynamic Programming

    For a random policy

    Since this is a model based algorithm, we know the value of using which we can calculate first state value function & then action value function Then we can update our policy to be actions that maximize the state value of that state.

    Continue iterating for better policies.

  • Math of Intelligence : Markov Decision Process

    Math Of Intelligence : Markov Decision Process

    What is a Markov Decision Process?

    A Markov Decision Process consists of 5 elements: S, A, P, R,

    S set of states

    A set of actions

    R reward function

    P transition probability function: P(s’,r s,a)

    discounting factor

    The states of an MDP have a property that:

    It means that the future depends on the current state and not on the history of all previous states.

    Bellman Equations

    is the state value function. It describes the expected return given the current state s and Q(s,a) is the action value function which describes the expected return given the current state s and the action a, that the agent takes from state s.

    Here, is the expected return at time t. That is the expected sum of rewards that we will get after time t. So, can be represented as where is the discount factor.

    Now, Eq. (1) becomes:

    Similarly, for Q-value,

    Bellman Expectation Equations:

    Bellman Optimality Equations

    Lets find out the optimal values for state value and action value functions:

  • Math of Intelligence : Logistic Regression

    Math Of Intelligence : Logistic Regression

    Here, we will be figuring out the math for a binary logistic classifier.

    Logistic Regression is similar to Linear Regression but instead of a real valued output , it will be either 0 or 1 since we need to classify into one of 2 categories.

    In the linear regression post, we have defined our hypothesis function as:

    Now, we can also have multiple input features i.e and so on, so in that case our hypothesis function becomes:

    We have added with for simplification. Now, the hypothesis function can be expressed as a combination of just 2 vectors: and

    Still, the output of this function will be a real value, so we’ll apply an activation function to convert the output to 0 or 1. We’ll use the sigmoid function for this purpose. TODO: Explore other activation functions

    \begin{equation} g(z) = \frac{1}{1+e^{-z}} \end{equation}

    \begin{equation} h(X) = g(\theta^TX) = \frac{1}{1+e^{-\theta^TX}} \end{equation}

    The most commonly used loss function for logistic regression is log-loss (or cross-entropy) TODO: Why log-loss? Explore other loss functions.

    So, the loss function for training examples is:

    which can also be represented as:

    Now, similar to linear regression, we need to find out the value of $\theta$ that minimizes the loss. We can again use gradient descent for that. TODO: Explore other methods to minimize the loss function.

    where is the learning rate.

    From (8), we get that we need to find out to derive the gradient descent rule. Lets start by working with just one training example.

    can be broken down as follows:

    Calculating first:

    Using the chain rule of derivatives,

    Now, calculating ,

    Again, using the chain rule,

    Finally, combining (10),(11),(12), we get

    Plugging this back in (8),

  • Math of Intelligence : Linear Regression

    Math Of Intelligence : Linear Regression

    Let be the input feature and is the output that we are interested in.

    For linear regression, we need a hypothesis function that predicts y, given the input feature x.

    Let us assume that y is linearly dependent on x, so our hypothesis function is:

    Here ’s are the parameters(or weights). To simplify the notation, we will drop the in the subscript of and mention it simply as .

    Now, we need to find a way to measure the error between our predicted output $h(x)$ and the actual value y for all our training examples.

    One way to measure this error is the ordinary least squared method. TODO: Explore other cost functions

    So, the cost function(or loss function)* according to the ordinary least square method will be as follows:

    *there’s some debate about whether they are the same or not but for now we’ll assume they are the same

    On expanding , we get

    Our objective is to find the values of and that minimize the loss function.

    One way to do this is by using the Gradient descent method. TODO:Explore other methods to find the global minima of a function

    In this method, we first initialize randomly and then update it according to the above rule to come closer the minima with each update.

    Here, is the learning rate.

    Hence, in order to update , we need to find out the partial derivative of w.r.t. . In our case j = 0 and 1

    w.r.t.

    w.r.t.

    Combining equations (4) and (6) as well as (4) and (8) we get:

    The above equations can be used to update the weights and hence improve the hypothesis function with every training example.

    References:

    1. https://see.stanford.edu/materials/aimlcs229/cs229-notes1.pdf