import {synthetic_logistic_pic} from './components/interactive_pics.js';
viewof redraw = Inputs.button("Redraw")
synthetic_logistic_pic(redraw)Fri, Mar 06, 2026
We’ve been working to get to this point where we can understand the basics of logistic regression and that’s what we’ll do today. We’ll apply it when we get back together after Spring break.
We have data of the form \(\{(x_i,y_i)\}_{i=1}^n\), where
We wish to model this data using a random variable \(Y\) that depends on a random variable \(X\). We might say that our data consists of observations of the form \((X,Y)\).
import {synthetic_logistic_pic} from './components/interactive_pics.js';
viewof redraw = Inputs.button("Redraw")
synthetic_logistic_pic(redraw)In this picture, the data is drawn as dots at the levels \(y=0\) and \(y=1\). The curve represents \(P(Y=1|X)\), i.e. the probability that \(Y=1\) given an \(X\) value. As \(x\) increases, the outcomes become more likely so that the probability curve is increasing. The probability \(P\) must stay between zero and one, though, so we have asymptotic behavior.
We expect this simplest type of logistic regression to be applicable when we have a binary output variable that depends upon an input variable. Examples include,
There are generalizations that allow for more inputs and outputs. For example, what the next word after I type all of
We fit the data with a model called the sigmoid model, which can be expressed in terms of parameters \(a\) and \(b\):
\[ f_{a,b}(x) = \sigma(ax+b) = \frac{1}{1+e^{-(ax+b)}}. \]
Like linear regression, our objective is to find values for the parameters \(a\) and \(b\) such that the model function \(\sigma(ax+b)\) provides the “best fit” to given data.
The function \(\sigma(x)\) is called the sigmoid and is defined by \[ \sigma(x) = \frac{1}{1+e^{-x}}. \]
For some reason, the symbol \(\sigma\) was chosen for this function. The fields of probability and statistics are famous for overloading their symbols - none more so than the symbol “P”.
So, what is the “best fit”? When modelling a function that returns a probability and is defined in terms of parameters, it’s common to answer this question with the technique of maximum likelihood.
Given data, the basic idea is to choose the parameters that maximize the likelihood that our model would’ve produced that observed data.
Before jumping into this process for logistic regression, let’s take a look at a simpler example - one which is intuitive enough for us to “know” the answer right away but which also suggests the computational solution that we’ll ultimately apply to more difficult situation of logistic regression.
I have a (potentially) weighted coin that comes up heads with probability \(p\) or tails with probability \(1-p\). Suppose I flip the coin 100 times and it comes up heads 62 times. What is your best guess for the the value of \(p\)?
I suppose that the obvious answer is \(p\approx0.62\). Let’s explore a computational approach that yields that exact value.
First, the probability that we get a head on one flip is something; we’ll call it \(p\).
Next, from our discussion of the binomial distribution, we know that the probability that we get 62 heads in 100 flips is
\[ f(p) = \binom{100}{62} \, p^{62}(1-p)^{38}. \]
The idea is to choose \(p\) so that this quantity is maximized.
When dealing with a product of powers that we wish to optimize, it’s common to apply the log first. Since the logarithm is increasing, the maximum of \(\log(f(p))\) occurs at the same place as it does for \(f(p)\). Thus, we compute
\[ \log\left(\binom{100}{62} \, p^{62}(1-p)^{38}\right) = \log \binom{100}{62} + 62\,\log(p) + 38\, \log(1-p) \]
This expression is called the log-likelihood and that is what we will maximize.
Now, to maximize the log-likelihood, we compute the derivative:
\[\begin{align} \frac{d}{dp} (\log(f(p))) &= \frac{d}{dp} \left( \log \binom{100}{62} + 62\,\log(p) + 38\, \log(1-p)\right) \\ &= \frac{62}{p} - \frac{38}{1-p} = \frac{100 p-62}{(p-1) p}. \end{align}\]
This last expression is zero precisely when \(p=0.62\); thus, this is the choice of \(p\) that maximizes the probability that our observed data could actually happen.
Is it really easier to take logarithms? In this simple example, perhaps not.
In logistic regression, though, the number of terms corresponds to the amount of data. These terms are all less than one so their product is sure to lead to underflow. Thus, this step is really essential.
To make things more complicated, the log-likelihood is typically multiplied by \(-1\) to yield an expression called the “log-loss”. Most statisticians solve the logistic regression problem by minimizing the log loss.
As far as I know, this is translation to log-loss is done simply because so much software is already written to minimize function in this context.
We now apply maximum likelihood to optimize logistic regression. Again, we have data
\[\{(x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)\}.\]
We’re assuming that \[ P(y_i = 1 \mid x_i) = \sigma(a x_i + b) \]
Let’s denote that last probability by \(p_i\), that is \(p_i=\sigma(a x_i + b)\). Thus, \[ P(y_i = 1) = p_i \: \text{ and } \: P(y_i = 0) = 1 - p_i. \] This allows us to write either probability as \[ p_i^{y_i} (1 - p_i)^{1 - y_i}. \] This semi-tricky little formula is true because
Both as expected.
We now derive the formula for likelihood and log-likelihood. These should be functions of \(a\) and \(b\). The likelihood is \[ L(a,b) = \prod_{i=1}^n p_i^{y_i} (1 - p_i)^{1 - y_i}. \]
We take the logarithm to generate the log-likelihood: \[ \ell(a,b) = \log L(a,b) = \sum_{i=1}^n \left(y_i \log p_i + (1 - y_i)\log(1 - p_i)\right). \] These are functions of \(a\) and \(b\) because \(p_i\) is a function of \(a\) and \(b\).
This is the function that we maximize!
Let’s take a look at what this looks like in code for an interesting example.
I’ve got a data file with all the games for Big South Regular season this year. The file includes the schools, the scores, and the Massey ratings for each team.
We can read the file, concatenate it with itself where the rolls of the first and second teams are switched, and add columns for the outcomes and ratings difference.
The result is on the next slide.
import pandas as pd
df = pd.read_csv("https://marksmath.org/data/BigSouthRegularSeasonWithMasseyRatings2026.csv")
df_swapped = df.copy()
df_swapped = df_swapped.rename(columns={
"name1": "name2", "score1": "score2", "rating1": "rating2",
"name2": "name1", "score2": "score1", "rating2": "rating1"})
df_full = pd.concat([df, df_swapped], ignore_index=True)
df_full["rating_difference"] = df_full["rating1"] - df_full["rating2"]
df_full["win"] = (df_full["score1"] > df_full["score2"]).astype(int)
print(f"Total number of rows: {len(df_full)}")
df_full.sample(4, random_state=1)Total number of rows: 144
| date | name1 | score1 | rating1 | name2 | score2 | rating2 | rating_difference | win | |
|---|---|---|---|---|---|---|---|---|---|
| 94 | 2026-01-17 | Gardner_Webb | 55 | -15.555556 | Presbyterian | 92 | -0.666667 | -14.888889 | 0 |
| 91 | 2026-01-14 | High_Point | 75 | 13.000000 | Winthrop | 92 | 5.333333 | 7.666667 | 0 |
| 84 | 2026-01-10 | SC_Upstate | 50 | -5.666667 | Winthrop | 71 | 5.333333 | -11.000000 | 0 |
| 48 | 2026-02-12 | UNC_Asheville | 79 | 0.166667 | Longwood | 74 | 2.777778 | -2.611111 | 1 |
Looks like we’ve got 144 rows. Does that make sense?
There are 9 teams in the Big South so there are \[ \binom{9}{2} = \frac{9!}{7!\times2!} = 36 \text{ pairs of schools.} \] Each pair of teams plays twice and each game appears twice for a total of \[ 4\times4\times36 = 144 \text{ rows.} \]
Here’s a look at just the data. You can hover over the points to see what’s what and who’s who.
Our objective is to fit a function of the form \[ f(x) = \frac{1}{1+e^{-(ax+b)}} \] to this data. We expect that \(b=0\), due to the symmetry of the data but we’ll leave that up to the code.
To fit the function, we find the minimum of \[ \text{Loss(a)} = \sum_{i=1}^n \left(y_i \log(1/(1+e^{-ax_i})) + (1 - y_i)\log(1 - (1/(1+e^{-ax_i}))\right). \]
Here’s a plot of that function. It looks like the min is around \(a=0.2.\)
Here’s how to fit the model with SciKit Learn:
from sklearn.linear_model import LogisticRegression
X = df_full[["rating_difference"]]
y = df_full["win"]
model = LogisticRegression()
model.fit(X, y)
df_full["predicted_win_prob"] = model.predict_proba(X)[:, 1]
print("Intercept:", model.intercept_[0])
print("Coefficient:", model.coef_[0][0])Intercept: 0.0
Coefficient: 0.198269477600324
To be clear, the
Coefficient: 0.198269477600324
Indicates that \(a=0.198269\) and \(b\), of course, is zero. Thus our model function is \[ f(x) = \frac{1}{1+e^{-0.198269 x}}. \] We can use that function to make probabilistic predictions for new games, once we know the Massey rating difference.
Here’s another graph of the data that includes the model function:
We can apply this to make predictions for actual games - like the Big South Tournament that’s going on right now!
When studying linear regression, we study the optimization procedure itself quite closely.
Logistic regression is more complicated and, in particular, leads to non-linear optimization problems. These are typically solved using gradient descent or some other iterative algorithm. We don’t do this by hand.
The sigmoid at least looks like a good CDF but, other than that, it came a bit out of nowhere. Why not a CDF for a normal distribution?
Linear is known to work great when residuals are normally distributed about the predicted value. Thus, there will be values above and below the curve in the neighborhood of any point.
That condition is not satisfied in logistic regression; rather, the curve is below the data to the right and above it to the left.
Give a random event with probability \(p\), the expression \(p/(1-p)\) is called the odds for that event. Thus, \(\log(p/1-p)\) is called the log-odds.
Logistic regression can be shown to well when the log-odds are approximately linear. That is, \[ \log \frac{p}{1 - p} = a x + b. \] Solve for \(p\), we get the sigmoid-based expression \[ p = \frac{1}{1 + e^{-(a x + b)}}. \]
On Monday after break we’ll talk about how to
Comments
Here are some comments to (hopefully) clarify a few points, like