Issues in practical regression

Mon, Feb 24, 2025

Issues in practical regression

Last time, we took a look at an actual, real world regression project and saw some practical issues that arise in that context. Today, we’re going to discuss the mathematics behind the solution of those issues.

Imports

Were going to display a lot of code in this presentation from multiple libraries. Let’s go ahead an import all the functionality we’ll need right up front:

import pandas as pd
import numpy as np
import pickle
from numpy.random import random, seed
from scipy.linalg import solve
from scipy.optimize import minimize
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon, Circle
import plotly.express as px
import plotly.graph_objects as go
from IPython.display import HTML, display

Quick review of linear regression

Recall that the expression for the error in linear regression is \[ E(a_0,a_1) = \sum_{i=1}^N \left((a_1\,x_i + a_0) - y_i\right)^2, \] which is exactly equivalent to \[ \left\|\begin{pmatrix} x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_N & 1 \end{pmatrix} \begin{bmatrix} a_1 \\ a_0 \end{bmatrix} - \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{pmatrix}\right\|^2. \]

Minimization via orthogonal projection

Now, the expression on the left \[ \begin{bmatrix} x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_N & 1 \end{bmatrix} \begin{bmatrix} a_1 \\ a_0 \end{bmatrix} \] is the two-dimensional subspace of \(\mathbb R^N\) spanned by \[ \begin{bmatrix}x_1&x_2&\cdots&x_N\end{bmatrix}^T \quad \text{and} \quad \begin{bmatrix}1&1&\cdots&1\end{bmatrix}^T. \] To minimize the norm, we should measure the perpendicular distance from \[ \begin{bmatrix}y_1&y_2&\cdots&y_N\end{bmatrix}^T \quad \text{to that subspace.} \]

Minimization (cont)

Now write \[ \begin{aligned} X &= \begin{bmatrix} x_1&x_2&\cdots&x_N \\ 1&1&\cdots&1 \end{bmatrix}^T, \\ \mathbf{y} &= \begin{bmatrix}y_1&y_2&\cdots&y_N\end{bmatrix}^T, \text{ and} \\ \mathbf{a} &= \begin{bmatrix} a_1 & a_0 \end{bmatrix}^T. \end{aligned} \] The orthogonality condition states that \(X\mathbf{a} - \mathbf{y}\) should be perpendicular to both columns of \(X\). This can be written as \[ X^T(X\mathbf{a} - \mathbf{y}) = \mathbf{0} \quad \text{ or } \quad X^TX\mathbf{a} = X^T \mathbf{y}. \] This is exactly where the normal equations come from!

Linear reformulation

We’ll continue to express our algorithms in terms of linear algebra when possible. This will certainly be the case for multiple regression, polynomial regression, and L2 regularization. There are many advantages of this, including

  • Strong theorems on both existence and uniqueness of solutions,
  • Well developed numerical methods for finding solutions, and
  • The potential to exploit sparseness of systems.

L1 regularization will be the first point where we really need non-linear optimization.

Down the road, neural networks lead to non-linear models requiring non-linear optimization. Even then, though, the model can be realized as sequence of steps, each of which consists of a linear function composed with non-linear function.

Multiple regression

The housing prices project from the other day illustrates the need to account for multiple inputs in regression. In general, we might have \(N\) data points each with \(m\) inputs and a single response that look like \[ \{((x_{1j},x_{2j},\ldots,x_{mj}),y_j)\}_{j=1}^N. \] If we have a vector \(\mathbf{a} = \begin{bmatrix} a_m & a_{m-1} & \cdots & a_1 & a_0 \end{bmatrix}^T\) of \(m+1\) coefficients, then our approximation and squared error might look like \[ E(\mathbf{a}) = \sum_{i=1}^N \left(\left(\sum_{j=1}^m a_j x_{ij} + a_0\right) - y_i\right)^2. \]

Matrix formulation

We can reformulate the previous expression as a matrix problem; we just need a few more columns!

\[ \begin{bmatrix} 1 & x_{11} & \cdots & x_{1m} \\ 1 & x_{21} & \cdots & x_{2m} \\ \vdots & \vdots & \ddots & \vdots & \\ 1 & x_{N1} & \cdots & x_{Nm} \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \\ \vdots \\ a_m \end{bmatrix} \approx \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} \]

If we define the matrices that we see above as \(X\), \(\mathbf{a}\), and \(\mathbf{y}\), then we can find the least squares solution using the normal equations: \[X^T X \mathbf{a} = \mathbf{y}.\]

Illustration

Let’s illustrate this with the well known Palmer penguin data set.

Code
penguins = pd.read_csv('https://gist.githubusercontent.com/slopp/ce3b90b9168f2f921784de84fa445651/raw/4ecf3041f0ed4913e7c230758733948bc561f434/penguins.csv')
penguins = penguins[['species', 'sex', 'bill_length_mm', 'flipper_length_mm', 'body_mass_g']]
penguins = penguins.dropna()
print(f"{len(penguins)} observations")
penguins.head()
333 observations
species sex bill_length_mm flipper_length_mm body_mass_g
0 Adelie male 39.1 181.0 3750.0
1 Adelie female 39.5 186.0 3800.0
2 Adelie female 40.3 195.0 3250.0
4 Adelie female 36.7 193.0 3450.0
5 Adelie male 39.3 190.0 3650.0

Now suppose that we’d like to predict body mass using some of the other variables.

Scatter plot

Here’s a scatter plot for the data that relates the body mass to flipper length. There’s also a trendline whose equation you can find via hover.

Code
penguin_plot = px.scatter(penguins, 
  x = 'flipper_length_mm', y = 'body_mass_g',
  width=900, height=400, trendline='ols',
  trendline_color_override='black'
)
penguin_plot.show(config={'displayModeBar': False})

Computing the coefficients

We can compute the coefficients of the trendline directly as follows:

flipper_length = penguins.flipper_length_mm.to_numpy()
X = np.array([
    flipper_length,
    np.ones(len(flipper_length))
]).transpose()
Y = penguins.body_mass_g.to_numpy()
solve(X.transpose().dot(X), X.transpose().dot(Y))
array([   50.15326594, -5872.09268284])

Note that we’ve used exactly the matrix formulation of the least squares problem.

Adding another variable

OK, let’s add bill length to take this up by one variable:

flipper_length = penguins.flipper_length_mm.to_numpy()
bill_length_mm = penguins.bill_length_mm.to_numpy()
X = np.array([
    np.ones(len(flipper_length)),
    flipper_length,
    bill_length_mm
]).transpose()
Y = penguins.body_mass_g.to_numpy()
multi_coeffs = solve(X.transpose().dot(X), X.transpose().dot(Y))
multi_coeffs
array([-5.83629873e+03,  4.88896918e+01,  4.95860126e+00])

Prediction function

That allows us to define a function to predict body mass in terms of flipper and bill lengths. We can use that to create an approximating plane.

f = lambda x,y: np.dot(
  multi_coeffs[1:], [x,y]) + multi_coeffs[0]
x_min, x_max = penguins.flipper_length_mm.agg(
  ['min', 'max']).to_numpy()
y_min, y_max = penguins.bill_length_mm.agg(
  ['min', 'max']).to_numpy()
xx = [x_min, x_max, x_max, x_min]
yy = [y_min,y_min,y_max,y_max]
zz = [f(x_min,y_min), f(x_max,y_min),
  f(x_max,y_max), f(x_min,y_max)]

A 3D plot

With two predictor variables and one response, we can plot this in 3D:

Code
fig = go.Figure(data=[go.Scatter3d(
    x=penguins.flipper_length_mm,
    y=penguins.bill_length_mm,
    z=penguins.body_mass_g,
    mode='markers',
    marker=dict(size=4,opacity=1)),
    go.Mesh3d(x=xx,y=yy,z=zz,
    opacity=0.5, color='black',
)])
fig.show(config={'displayModeBar': False})

Polynomial regression

Another option to improve model accuracy is to use a different type of function to fit the data. Generally, though, you should have good reason to believe that some other type of function is a better fit to your data to use it.

In this column of slides, we’ll play with a data set from a 1995 article in the Journal of Statistics Education entitled “Teaching Statistics with Data of Historic Significance” The article presents data that is supposedly taken directly from Galileo’s notebooks in his early studies on gravitational motion. Given the context, it probably makes sense to fit a quadratic to the data.


Note that this is still linear regression, because a quadratic is a linear combination of \(1\), \(x\), and \(x^2\).

The data

D = np.array([[ 573, 534, 495, 451, 395, 337, 253],
  [1000,  800,  600,  450,  300,  200,  100]])
X = D[0,:]; Y = D[1,:]
pic = plt.plot(X,Y,'ok')

Fit and plot

We’re going to go through the process of fitting some data D to a polynomial of degree n and displaying the resulting coefficients and graph several times so let’s gather the steps in a function:

def fit_and_plot(D, n):
  X = D[0,:]; Y = D[1,:]
  A = np.array([X**(k) for k in range(n+1)]).transpose()
  coeff = np.array(solve(A.transpose().dot(A), A.transpose().dot(Y)))
  display(HTML(f"Coefficients: {coeff}"))

  f = lambda x: np.dot(coeff, [x**p for p in range(n+1)])
  x = np.linspace(np.min(X),np.max(X),500)
  y = [f(xx) for xx in x]
  plt.plot(x,y)
  plt.plot(X,Y, 'ok')

Linear fit

Here’s how fit_and_plot works at first order:

fit_and_plot(D, 1)
Coefficients: [-713.26188807 2.77907611]

Quadratic fit

And at second order, which works great for this data:

fit_and_plot(D, 2)
Coefficients: [ 6.00053725e+02 -4.00484297e+00 8.18073977e-03]

Overfitting

As your model starts to grow through either an increase of inputs or complexity, you run the risk of overfitting. This means that your model matches the peculiarities of the training data in a way that doesn’t extend to general data arising from the same source.

This is a problem that is tremendously important to understand.

In this column of slides, we’ll clearly see overfitting arise in a polynomial fit.

Some data

Here’s some randomly generated data that’s just a standard normal perturbation from purely linear data:

seed(1)
X = np.array([i + random() for i in range(7)])
Y = np.array([i + random() for i in range(7)])
D = np.array([X,Y])

Linear fit

The data should have have a pretty good linear fit:

fit_and_plot(D,1)
Coefficients: [-0.13062239 1.11009494]

Higher order?

In general \(N\) data points can be fit exactly with a polynomial of degree \(N-1\). (I suppose this makes sense, since a polynomial of degree \(N-1\) has \(N\) coefficients to choose.)

So… our randomly generated data set has \(5\) points - perhaps, we should try to fit it with a fourth degree polynomial??

Let’s see how that goes!

Quartic fit

fit_and_plot(D,len(D[0])-1) 
Coefficients: [ 1.73922986e+01 -6.68694128e+01 7.69602895e+01 -3.90268395e+01 9.95542284e+00 -1.24809441e+00 6.11062780e-02]

Hmm… Even though the polynomial fits the data perfectly, it can vary wildly between the data points.

Moral

  1. Don’t increase model complexity without a sound theoretical basis.
  2. Don’t test your model on training data.

We’ll discuss the test/train strategy and cross-validation shortly.

Regularization

Regularization is strategy to counter overfitting. It’s based on the observation that the coefficients of a polynomial fit tend to grow as the order of the polynomial grows. Thus, to avoid overfitting, we place a penalty on the size of the coefficients.

In functional terms, we might minimize \[ \|X\mathbf{a} - \mathbf{y}\|^2 + \lambda \|\mathbf{a}\|^2. \]

Note that \(\|\cdot\|\) represents a norm. Common choices for the norm are L1 or L2, resulting in ridge regression or LASSO regression.

In either case, \(\lambda\) is a parameter; the larger \(\lambda\), the larger the influence of regularization.

Rephrased as a matrix problem

In ridge regression for a given \(\lambda\), the regularized minimization can be achieved by solving the linear system \[ (X^TX + \lambda I)\mathbf{a} = X^T \mathbf{y}. \] This is a special property of the L2 norm, that doesn’t work for a more general norm, including L1. Furthermore we often treat \(\lambda\) itself as a parameter to be optimized, which breaks the linearity of the objective function.

Thus, in the sample code that follows, we’ll use SciPi’s nonlinear optimizer.

Example

Reall that we’ve already got a list of 7 data points that we generated randomly in the last column of slides with \(x\)-coordinates:

X: [0.417022 1.72032449 2.00011437 3.30233257 4.14675589 5.09233859 6.18626021]

and \(y\)

Y: [0.34556073 1.39676747 2.53881673 3.41919451 4.6852195 5.20445225 6.87811744]

We also saw that a polynomial of degree six overfit the data badly. Let’s apply the regularized strategy to that data to see the impact.

Example (cont)

Here are the coefficients of the polynomial that minimumizes the regularized error:

def F(coeff, x):
    return np.dot(coeff, [x**p for p in range(len(D[0]))])
def norm2(coeff):
    return sum((F(coeff, D[0]) - D[1])**2) + sum(coeff**2)
coeff = minimize(norm2, np.ones(len(D[0]))).x
HTML(f"Coefficients: {coeff}")
Coefficients: [ 0.15767196 0.21332694 0.26959746 0.21194146 -0.11663689 0.01896963 -0.00100179]

Of note, they are considerably smaller than the coefficients without regularization.

Example (finale)

The resulting fit looks much better, in this case:

plt.plot(D[0], D[1], 'ok')
x = np.linspace(D[0][0], D[0][-1], 500)
y = F(coeff, x)
plt.plot(x,y)

Regularization in multiple regression

We’ve examined regularization in polynomial regression here because it nicely illustrates the effect. Multiple linear regression is used much more commonly in machine learning, though, so it’s important to understand that the same effect happens in that context as well.

Indeed, when the number of inputs goes up, accuracy improves. If the number of inputs reaches the number of observations, then you expect to be able to match the data exactly.

There’s a balance between accuracy and generality that can be a bit tricky to find.

Variable selection

A large number of input variables can lead to several problems, including

  • Overfitting (as we just saw),
  • increased computation time, and
  • ill-conditioned equations (like close to singular matrices).

Thus, we have good reason to attempt to reduce the number of variables, where appropriate.

Regularization and variable selection

It turns out that the process of regularization can shed light on variable selection and L1 regularization is particularly well suited for this task.

The common term LASSO to refer to L1 regularization is, in fact, an acronym standing for

Least Absolute Shrinkage and Selection Operator

In this column of slides, we’ll illustrate how this might work.

Reformulation

In order to understand how variable selection works, it helps to reformulate the regularized minimization problem as follows:

We minimize \[\|X\mathbf{a} - \mathbf{y}\|^2 \text{ subject to } \|\mathbf{a}\| < s,\] where \(s\) is a parameter related to \(\lambda\) from our original formulation.

This reformulation leads to a nice visualization illustrating why this might work.

An SSE function to regularize

OK, let’s generate a small data set with a Sum of Squared Errors function (SSE) to work with:

seed(1)
X = np.array([i + random()/4 for i in range(3)])
Y = np.array([i + random()/4 for i in range(3)])
def F(coeff, x): # First order polynomial
    return np.dot(coeff, [x**p for p in range(2)])
def sse(coeff): # Sum squared error
    return sum((F(coeff, X) - Y)**2)

SSE??

The sse function takes a list of two coefficients to define a first order polynomial. For each point in our data set with \(x\)-coordinates X and \(y\)-coordinates Y, sse then computes the sum of the squared error. Finally, sse returns the sum of those squared errors.

Here’s the squared error for the polynomial \(f(x) = 2x+1\), for example:

sse([1,2])
np.float64(15.54441908547873)

Minimum of the SSE

We can use the sse function in minimization commands. For example, it has a global minimum around
[x,y] = [-0.07, 1.02]:

xmin, ymin = minimize(sse, [1,1]).x
[xmin,ymin]
[np.float64(-0.07210089840495718), np.float64(1.0204894121311614)]

L2 Min

The sse function also has an L2 regularized minimum of about \(0.17855\) at about \((0.185, 0.729)\).

xmin2, ymin2 = minimize(
  lambda c: sse(c) + sum(c**2), [1,1]).x
value2 = sse([xmin2,ymin2])
print([xmin2,ymin2])
print(value2)
[np.float64(0.18508541043345267), np.float64(0.7292170433368331)]
0.1785539769184335

L1 Min

We can do the same thing with the L1 norm.

xmin1, ymin1 = minimize(
  lambda c: sse(c) + sum(np.abs(c)), [1,1]).x
value1 = sse([xmin1,ymin1])
print([xmin1,ymin1])
print(value1)
[np.float64(-4.963373725908049e-09), np.float64(0.8768752286598878)]
0.07278180946316129

Note that the \(x\)-coordinate of the minimum is effectively zero.

SSE contours

The contours of the sse function appear to be ellipses that collapse down to the minimum:

Code
SSE = np.vectorize(lambda x, y: sse([x, y]))
x = np.linspace(-1.5,1.5, 100)
y = np.linspace(-1, 2, 100)
X_grid, Y_grid = np.meshgrid(x, y)
Z = SSE(X_grid, Y_grid)

fig, ax = plt.subplots(figsize=(6,6))
contour = ax.contour(
  X_grid, Y_grid, Z, cmap='Blues',
  levels=[-1, value1, value2, 0.4, 0.8])
ax.plot([xmin], [ymin], 'ok')

ax.grid(True)
ax.set_aspect('equal')
plt.axhline(color='black')
plt.axvline(color='black')

serialized_fig = pickle.dumps(fig)
plt.show()

Contours and a ball in L2

Now, the L2 regularized minimum occurs at the point of tangency between a contour and a circle in the L2 norm. If the \(x\) and \(y\) inputs have a common scale, then it certainly seems that the \(y\)-coordinate is the more important of the two.

Code
fig = pickle.loads(serialized_fig)
ax = fig.axes[0]
r = np.sqrt(xmin2**2 + ymin2**2)
circle = Circle((0, 0), 
    radius=r, edgecolor='red', 
    facecolor=(1,0,0,0.2), linewidth=2
)
ax.add_patch(circle)
ax.plot([xmin2], [ymin2], 'ok')
plt.show()

Contours and a ball in L1

Here’s the corresponding picture in the L1 norm. We can see that the \(x\)-coordinate is effectively zero and (presumably) much less important than the \(y\)-coordinate. More importantly, we can see why this happens.

Code
fig = pickle.loads(serialized_fig)
ax = fig.axes[0]

polygon = Polygon([
  (ymin1,0), (0,ymin1), (-ymin1,0), (0,-ymin1)
], edgecolor='red', 
    facecolor=(1,0,0,0.2), linewidth=2)
ax.add_patch(polygon)
ax.plot([xmin1], [ymin1], 'ok')

plt.show()

Regularization in scikit-learn

Recall that our housing data had 80 variables and it seems reasonable to reduce that number. SciKit-Learn can automate the LASSO process and help us with that variable selection.

Cross-validation

Finally, let’s mention a bit about cross-validation.

Generally, you fit your model to the training data but you evaluate your model with testing data.

In the context of a Kaggle competition, they ultimately evaluate your model using the testing data that they provide. There’s no reason you can attempt to evaluate your model first, though.

Strategy

For the rest of this column, let’s refer to the two types of data as

  • labeled data and
  • unlabeled data.

The idea is to fit and evaluate your model using the labeled data. Ultimately, you want to use it on unlabeled data.

To fit and evaluate a model using labeled data, we’ll break it into two parts:

  • Training data and
  • Testing data.

We’ll fit the training data and evaluate using the (labeled) testing data.

Cross-validation

In \(k\)-fold cross-validation, we randomly partition the labeled data into \(k\) pieces. We then set aside \(1\) piece for testing and use the other \(k-1\) pieces for training. This is done \(k\) times using each of the \(k\) pieces for testing.

For \(5\)-fold cross-validation, we run the process five times training with 80% of the data and testing with 20% each time.

Cross-validation in SciKit-Learn

Functions like ElasticNetCV in scikit-learn automate this procedure. A few select parameters (like regularization coefficients or L1 ratio) can be optimized using the cross-validation information. The final fit is then obtained by fitting the full data using the non-coefficient parameters.

In addition, information on the cross-validation results are returned in the regression object.

Detecting overfitting

You can detect overfitting in the context of linear regression quite concretely using cross-validation. Suppose we’ve got \(n\) observations and \(p\) predictors. If TRAIN_MSE and TEST_MSE denote the Mean Square Error fo the Train and Test sets, then we can prove that the expected value of TRAIN_MSE is \[ \text{TRAIN_MSE} = \frac{n-p-1}{n+p+1} \times \text{TEST_MSE}. \] Thus, a significantly smaller value of TRAIN_MSE indicates overfitting.