import {synthetic_logistic_pic} from '../Lecture15-BasicLogistics/components/interactive_pics.js';
synthetic_logistic_pic()Mon, Mar 16, 2026
Let’s start today with a recap of the things we’ve learned recently that’s somewhat broader than we usually see. We’ll even go back to linear regression to see how this all fits together.
We discussed the general linear regression problem of “solving” \[ A\mathbf{x} \approx \mathbf{b} \] as closely as we can by solving the so-called normal equations \[ A^{\mathsf{T}}A\mathbf{x} = A^{\mathsf{T}}\mathbf{b}. \] As abstract as this sounds, it provides a framework for computational efficiency.
In addition, the matrix \(A\) can be quite high dimensional, allowing for multiple regression and a lot of potential predictor variables.
We’ve seen several examples of this in practice:
In the process, several practical issues and potential solutions arose, as described in our presentation on practical regression. These include
All these same concepts play a role in logistic regression, as well. In fact, they play a role across all kinds of machine learning algorithms.
Let’s remind ourselves how this all looks in the context of linear regression using notation that’s common in machine learning. In that context, we often write the regression problem as \[ X\mathbf{w} \approx \mathbf{y}, \] where
The entries of \(\mathbf{w}\) are sometimes called weights or parameters.
In linear regression, the objective is to find the “best” vector \(\mathbf{w}\) of weights. We quantify “best” in terms of Mean Square Error or MSE: \[ \mathrm{MSE}(\mathbf{w}) =\frac{1}{n}\|\mathbf{y} - X\mathbf{w}\|^2. \] You should recognize that \[ \|\mathbf{y} - X\mathbf{w}\|^2 = \sum_{i=1}^n (y_i - f_{\mathbf{w}}(x_i))^2. \] That’s the expression we would minimize in the most basic formulation of linear regression.
The idea behind regularization is to add a penalty if the coefficients are too large. Thus, we write
\[ \mathrm{MSE}_\lambda(\mathbf{w}) = \frac{1}{n}\|\mathbf{y}-X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_2^2. \]
The symbol \(\lambda\) is the regularization constant; the larger \(\lambda\), the stronger the regularization.
The right value of \(\lambda\) to use for a particular problem can be difficult to determine. We often estimate it using cross-validation.
Parameters like \(\lambda\) are typically distinguished from the entries of \(\mathbf{w}\); we refer to \(\lambda\) as a hyperparameter.
We might also consider \[ \mathrm{MSE}_\lambda(\mathbf{w}) = \frac{1}{n}\|\mathbf{y}-X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_1. \] The difference between this slide and the last is the use of the L1 norm on the weight vector, rather than L2. This tends to have the effect of zeroing out some weights that are less important.
Elastic regularization refers to a mix of L1 and L2. Thus, we might minimize \[ \text{MSE}_{\lambda,\alpha}(\mathbf{w}) = \frac{1}{n}\|\mathbf{y}-X\mathbf{w}\|^2 + \lambda \left(\alpha\|\mathbf{w}\|_1 + (1-\alpha)\|\mathbf{w}\|_2^2\right). \] Note that this offers more flexibility and, potentially, better results. It also introduces a second hyperparameter and more computational complexity.
Let’s take a look at how the corresponding notation progresses in the context of logistic regression.
We’re no longer trying to approximate \((x,y)\) data that lies close to a line. Rather, we’re trying to approximate \((x,y)\) data where each \(y\) is zero or one, like so:
In logistic regression, we don’t predict \(y\) values directly; we predict a probability \(p\) that \(y=1\) given an \(x\) value.
To do so, define the logistic (or sigmoid) function \[ \sigma(z)=\frac{1}{1+e^{-z}}. \] Then, \(p_i=\sigma(\mathbf{x}_i^{\mathsf T}\mathbf{w})\).
Then, we minimize the log-loss obtained from a maximum likelihood estimate: \[ L(\mathbf{w}) = -\frac{1}{n} \sum_{i=1}^n \left[y_i\log p_i + (1-y_i)\log(1-p_i)\right]. \]
With L2 regularization, this formula becomes \[ L_\lambda(\mathbf{w}) = -\frac{1}{n} \sum_{i=1}^n \left[ y_i\log p_i + (1-y_i)\log(1-p_i) \right] + \lambda\|\mathbf{w}\|_2^2. \]
And, with elastic regularization \[ \begin{aligned} L_{\lambda,\alpha}(\mathbf{w}) &= -\frac{1}{n} \sum_{i=1}^n \left[ y_i\log p_i + (1-y_i)\log(1-p_i) \right] \\ &+ \lambda\left(\alpha\|\mathbf{w}\|_1 + (1-\alpha)\|\mathbf{w}\|_2^2\right). \end{aligned} \] We again have hyperparameters that are generally determined with cross-validation.
Let’s try to specify how cross-validation works a bit more clearly. Generally, we have a regularized loss function \[L_{\alpha,\lambda}(\mathbf{w})\] that we’re trying to minimize. For fixed hyperparameters \(\alpha\) and \(\lambda\), there’s a standard procedure to find the optimal weights \(\mathbf{w}\). For example,
These optimization procedures are well established.
The hyperparameters, \(\alpha\) and \(\lambda\), live outside the scope of that minimization technique, though. That’s where cross-validation comes in.
Recall that the loss function is defined using data. Cross-validation proceeds by repeatedly splitting the data into training and validation sets. We
Here’s a reasonably simple way to think about it:
Cross-validation is a computationally intense procedure used to choose hyperparameters for machine learning algorithms.
It works by breaking the data into smaller parts, repeatedly fitting the algorithm using most of the data, and evaluating the resulting model on the remaining held-out part.
There’s one more practical trick we’ll learn and apply soon called
feature engineering.
Feature engineering is the process of using existing data to generate new variables that more directly align with the prediction target. Typically, this draws on domain expertise rather than anything obvious in the data itself.
Here’s an overly simple example that, hopefully, gets the idea across.
Suppose we want to develop an algorithm to assess the value of the lots in a city grid. The lots are all rectangular and we know their length and width. As a developer, we know that people really care about area. So…
We devise a new feature from the length and width and simply store that in a new column called area. Our expectation is that this new variable is likely to correlate more strongly with the value of the lot.
This new type of feature is sometimes called an interaction feature. Interaction features are quite common in sports analytics. For example, we might compute rating difference between two teams.
Here are a few more realistic examples of feature engineering.
We’ll see all of this in practice in our Titanic Notebook!
We’ll use these ideas next time when approaching Kaggle’s NCAA Competition!!