Math Of Intelligence : Logistic Regression

Here, we will be figuring out the math for a binary logistic classifier.

Logistic Regression is similar to Linear Regression but instead of a real valued output , it will be either 0 or 1 since we need to classify into one of 2 categories.

In the linear regression post, we have defined our hypothesis function as:

Now, we can also have multiple input features i.e and so on, so in that case our hypothesis function becomes:

We have added with for simplification. Now, the hypothesis function can be expressed as a combination of just 2 vectors: and

Still, the output of this function will be a real value, so we’ll apply an activation function to convert the output to 0 or 1. We’ll use the sigmoid function for this purpose. TODO: Explore other activation functions

\begin{equation} g(z) = \frac{1}{1+e^{-z}} \end{equation}

\begin{equation} h(X) = g(\theta^TX) = \frac{1}{1+e^{-\theta^TX}} \end{equation}

The most commonly used loss function for logistic regression is log-loss (or cross-entropy) TODO: Why log-loss? Explore other loss functions.

So, the loss function for training examples is:

which can also be represented as:

Now, similar to linear regression, we need to find out the value of $\theta$ that minimizes the loss. We can again use gradient descent for that. TODO: Explore other methods to minimize the loss function.

where is the learning rate.

From (8), we get that we need to find out to derive the gradient descent rule. Lets start by working with just one training example.

can be broken down as follows:

Calculating first:

Using the chain rule of derivatives,

Now, calculating ,

Again, using the chain rule,

Finally, combining (10),(11),(12), we get

Plugging this back in (8),