Basics of logistic regression

Fri, Mar 06, 2026

Recap and look ahead

We’ve been working to get to this point where we can understand the basics of logistic regression and that’s what we’ll do today. We’ll apply it when we get back together after Spring break.

Overview

We have data of the form \(\{(x_i,y_i)\}_{i=1}^n\), where

  • Each \(x_i\) is a continuous, numeric variable and
  • Each \(y_i\) is binary outcome - either \(0\) or \(1\).

We wish to model this data using a random variable \(Y\) that depends on a random variable \(X\). We might say that our data consists of observations of the form \((X,Y)\).

Illustration

In this picture, the data is drawn as dots at the levels \(y=0\) and \(y=1\). The curve represents \(P(Y=1|X)\), i.e. the probability that \(Y=1\) given an \(X\) value. As \(x\) increases, the outcomes become more likely so that the probability curve is increasing. The probability \(P\) must stay between zero and one, though, so we have asymptotic behavior.

Reasonable data

We expect this simplest type of logistic regression to be applicable when we have a binary output variable that depends upon an input variable. Examples include,

  • Successful treatment as a function of dosage,
  • Passing grade as a function of study time,
  • Victory in a competition as a function of rating difference.

There are generalizations that allow for more inputs and outputs. For example, what the next word after I type all of

The model

We fit the data with a model called the sigmoid model, which can be expressed in terms of parameters \(a\) and \(b\):

\[ f_{a,b}(x) = \sigma(ax+b) = \frac{1}{1+e^{-(ax+b)}}. \]

The logistic task

Like linear regression, our objective is to find values for the parameters \(a\) and \(b\) such that the model function \(\sigma(ax+b)\) provides the “best fit” to given data.

The function \(\sigma(x)\) is called the sigmoid and is defined by \[ \sigma(x) = \frac{1}{1+e^{-x}}. \]

For some reason, the symbol \(\sigma\) was chosen for this function. The fields of probability and statistics are famous for overloading their symbols - none more so than the symbol “P”.

Maximum likelihood

So, what is the “best fit”? When modelling a function that returns a probability and is defined in terms of parameters, it’s common to answer this question with the technique of maximum likelihood.

Given data, the basic idea is to choose the parameters that maximize the likelihood that our model would’ve produced that observed data.

Before jumping into this process for logistic regression, let’s take a look at a simpler example - one which is intuitive enough for us to “know” the answer right away but which also suggests the computational solution that we’ll ultimately apply to more difficult situation of logistic regression.

Example (for max likelihood)

I have a (potentially) weighted coin that comes up heads with probability \(p\) or tails with probability \(1-p\). Suppose I flip the coin 100 times and it comes up heads 62 times. What is your best guess for the the value of \(p\)?

I suppose that the obvious answer is \(p\approx0.62\). Let’s explore a computational approach that yields that exact value.

Max likelihood computation (1)

First, the probability that we get a head on one flip is something; we’ll call it \(p\).

Next, from our discussion of the binomial distribution, we know that the probability that we get 62 heads in 100 flips is

\[ f(p) = \binom{100}{62} \, p^{62}(1-p)^{38}. \]

The idea is to choose \(p\) so that this quantity is maximized.

Max likelihood computation (2)

When dealing with a product of powers that we wish to optimize, it’s common to apply the log first. Since the logarithm is increasing, the maximum of \(\log(f(p))\) occurs at the same place as it does for \(f(p)\). Thus, we compute

\[ \log\left(\binom{100}{62} \, p^{62}(1-p)^{38}\right) = \log \binom{100}{62} + 62\,\log(p) + 38\, \log(1-p) \]

This expression is called the log-likelihood and that is what we will maximize.

Max likelihood computation (3)

Now, to maximize the log-likelihood, we compute the derivative:

\[\begin{align} \frac{d}{dp} (\log(f(p))) &= \frac{d}{dp} \left( \log \binom{100}{62} + 62\,\log(p) + 38\, \log(1-p)\right) \\ &= \frac{62}{p} - \frac{38}{1-p} = \frac{100 p-62}{(p-1) p}. \end{align}\]

This last expression is zero precisely when \(p=0.62\); thus, this is the choice of \(p\) that maximizes the probability that our observed data could actually happen.

Logarithms???

Is it really easier to take logarithms? In this simple example, perhaps not.

In logistic regression, though, the number of terms corresponds to the amount of data. These terms are all less than one so their product is sure to lead to underflow. Thus, this step is really essential.

To make things more complicated, the log-likelihood is typically multiplied by \(-1\) to yield an expression called the “log-loss”. Most statisticians solve the logistic regression problem by minimizing the log loss.

As far as I know, this is translation to log-loss is done simply because so much software is already written to minimize function in this context.

Application to logistic regression

We now apply maximum likelihood to optimize logistic regression. Again, we have data

\[\{(x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)\}.\]

We’re assuming that \[ P(y_i = 1 \mid x_i) = \sigma(a x_i + b) \]

A too cute formula

Let’s denote that last probability by \(p_i\), that is \(p_i=\sigma(a x_i + b)\). Thus, \[ P(y_i = 1) = p_i \: \text{ and } \: P(y_i = 0) = 1 - p_i. \] This allows us to write either probability as \[ p_i^{y_i} (1 - p_i)^{1 - y_i}. \] This semi-tricky little formula is true because

  • when \(y_i=1\), we get \(p_i\) and
  • when \(y_i=1-i\), we get \(1-p_i\).

Both as expected.

Logistic log-likelihood

We now derive the formula for likelihood and log-likelihood. These should be functions of \(a\) and \(b\). The likelihood is \[ L(a,b) = \prod_{i=1}^n p_i^{y_i} (1 - p_i)^{1 - y_i}. \]

We take the logarithm to generate the log-likelihood: \[ \ell(a,b) = \log L(a,b) = \sum_{i=1}^n \left(y_i \log p_i + (1 - y_i)\log(1 - p_i)\right). \] These are functions of \(a\) and \(b\) because \(p_i\) is a function of \(a\) and \(b\).

This is the function that we maximize!

Example

Let’s take a look at what this looks like in code for an interesting example.

I’ve got a data file with all the games for Big South Regular season this year. The file includes the schools, the scores, and the Massey ratings for each team.

We can read the file, concatenate it with itself where the rolls of the first and second teams are switched, and add columns for the outcomes and ratings difference.

The result is on the next slide.

The data

Code
import pandas as pd
df = pd.read_csv("https://marksmath.org/data/BigSouthRegularSeasonWithMasseyRatings2026.csv")
df_swapped = df.copy()
df_swapped = df_swapped.rename(columns={
    "name1": "name2", "score1": "score2", "rating1": "rating2",
    "name2": "name1", "score2": "score1", "rating2": "rating1"})
df_full = pd.concat([df, df_swapped], ignore_index=True)
df_full["rating_difference"] = df_full["rating1"] - df_full["rating2"]
df_full["win"] = (df_full["score1"] > df_full["score2"]).astype(int)
print(f"Total number of rows: {len(df_full)}")
df_full.sample(4, random_state=1)
Total number of rows: 144
date name1 score1 rating1 name2 score2 rating2 rating_difference win
94 2026-01-17 Gardner_Webb 55 -15.555556 Presbyterian 92 -0.666667 -14.888889 0
91 2026-01-14 High_Point 75 13.000000 Winthrop 92 5.333333 7.666667 0
84 2026-01-10 SC_Upstate 50 -5.666667 Winthrop 71 5.333333 -11.000000 0
48 2026-02-12 UNC_Asheville 79 0.166667 Longwood 74 2.777778 -2.611111 1

Quick check

Looks like we’ve got 144 rows. Does that make sense?

There are 9 teams in the Big South so there are \[ \binom{9}{2} = \frac{9!}{7!\times2!} = 36 \text{ pairs of schools.} \] Each pair of teams plays twice and each game appears twice for a total of \[ 4\times4\times36 = 144 \text{ rows.} \]

A dot plot

Here’s a look at just the data. You can hover over the points to see what’s what and who’s who.

Objective

Our objective is to fit a function of the form \[ f(x) = \frac{1}{1+e^{-(ax+b)}} \] to this data. We expect that \(b=0\), due to the symmetry of the data but we’ll leave that up to the code.

To fit the function, we find the minimum of \[ \text{Loss(a)} = \sum_{i=1}^n \left(y_i \log(1/(1+e^{-ax_i})) + (1 - y_i)\log(1 - (1/(1+e^{-ax_i}))\right). \]

Optimization illustration

Here’s a plot of that function. It looks like the min is around \(a=0.2.\)

Doing it

Here’s how to fit the model with SciKit Learn:

from sklearn.linear_model import LogisticRegression
X = df_full[["rating_difference"]]
y = df_full["win"]
model = LogisticRegression()
model.fit(X, y)
df_full["predicted_win_prob"] = model.predict_proba(X)[:, 1]

print("Intercept:", model.intercept_[0])
print("Coefficient:", model.coef_[0][0])
Intercept: 0.0
Coefficient: 0.198269477600324

Interpretation

To be clear, the

Coefficient: 0.198269477600324

Indicates that \(a=0.198269\) and \(b\), of course, is zero. Thus our model function is \[ f(x) = \frac{1}{1+e^{-0.198269 x}}. \] We can use that function to make probabilistic predictions for new games, once we know the Massey rating difference.

Final plot

Here’s another graph of the data that includes the model function:

The tournament

We can apply this to make predictions for actual games - like the Big South Tournament that’s going on right now!

Comments

Here are some comments to (hopefully) clarify a few points, like

  • How does the optimization happen?
  • Why use the “sigmoid”?
  • How does this generalize?

Optimization

When studying linear regression, we study the optimization procedure itself quite closely.

  • We begin by using partial derivatives to see how a linear system naturally occurs and solve small examples by hand.
  • After a deep dive into linear algebra, we express the same problem in terms of orthogonal projection to yield a system called the normal equations that we can again solve in small cases.

Logistic regression is more complicated and, in particular, leads to non-linear optimization problems. These are typically solved using gradient descent or some other iterative algorithm. We don’t do this by hand.

Why the sigmoid?

The sigmoid at least looks like a good CDF but, other than that, it came a bit out of nowhere. Why not a CDF for a normal distribution?

Linear is known to work great when residuals are normally distributed about the predicted value. Thus, there will be values above and below the curve in the neighborhood of any point.

That condition is not satisfied in logistic regression; rather, the curve is below the data to the right and above it to the left.

Log-odds

Give a random event with probability \(p\), the expression \(p/(1-p)\) is called the odds for that event. Thus, \(\log(p/1-p)\) is called the log-odds.

Logistic regression can be shown to well when the log-odds are approximately linear. That is, \[ \log \frac{p}{1 - p} = a x + b. \] Solve for \(p\), we get the sigmoid-based expression \[ p = \frac{1}{1 + e^{-(a x + b)}}. \]

Generalizations

On Monday after break we’ll talk about how to

  • Take more inputs into account,
  • deal with more classes,
  • and more!