Last time, we took a look at an actual, real world regression project and saw some practical issues that arise in that context. Today, we’re going to discuss the mathematics behind the solution of those issues.
Imports
Were going to display a lot of code in this presentation from multiple libraries. Let’s go ahead an import all the functionality we’ll need right up front:
import pandas as pdimport numpy as npimport picklefrom numpy.random import random, seedfrom scipy.linalg import solvefrom scipy.optimize import minimizeimport matplotlib.pyplot as pltfrom matplotlib.patches import Polygon, Circleimport plotly.express as pximport plotly.graph_objects as gofrom IPython.display import HTML, display
Quick review of linear regression
Recall that the expression for the error in linear regression is \[
E(a_0,a_1) = \sum_{i=1}^N \left((a_1\,x_i + a_0) - y_i\right)^2,
\] which is exactly equivalent to \[
\left\|\begin{pmatrix}
x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_N & 1
\end{pmatrix}
\begin{bmatrix} a_1 \\ a_0 \end{bmatrix} -
\begin{pmatrix}
y_1 \\ y_2 \\ \vdots \\ y_N
\end{pmatrix}\right\|^2.
\]
Minimization via orthogonal projection
Now, the expression on the left \[
\begin{bmatrix}
x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_N & 1
\end{bmatrix}
\begin{bmatrix} a_1 \\ a_0 \end{bmatrix}
\] is the two-dimensional subspace of \(\mathbb R^N\) spanned by \[
\begin{bmatrix}x_1&x_2&\cdots&x_N\end{bmatrix}^T
\quad \text{and} \quad
\begin{bmatrix}1&1&\cdots&1\end{bmatrix}^T.
\] To minimize the norm, we should measure the perpendicular distance from \[
\begin{bmatrix}y_1&y_2&\cdots&y_N\end{bmatrix}^T
\quad \text{to that subspace.}
\]
Minimization (cont)
Now write \[
\begin{aligned}
X &=
\begin{bmatrix}
x_1&x_2&\cdots&x_N \\
1&1&\cdots&1
\end{bmatrix}^T, \\
\mathbf{y} &= \begin{bmatrix}y_1&y_2&\cdots&y_N\end{bmatrix}^T,
\text{ and} \\
\mathbf{a} &= \begin{bmatrix} a_1 & a_0 \end{bmatrix}^T.
\end{aligned}
\] The orthogonality condition states that \(X\mathbf{a} - \mathbf{y}\) should be perpendicular to both columns of \(X\). This can be written as \[
X^T(X\mathbf{a} - \mathbf{y}) = \mathbf{0}
\quad \text{ or } \quad
X^TX\mathbf{a} = X^T \mathbf{y}.
\] This is exactly where the normal equations come from!
Linear reformulation
We’ll continue to express our algorithms in terms of linear algebra when possible. This will certainly be the case for multiple regression, polynomial regression, and L2 regularization. There are many advantages of this, including
Strong theorems on both existence and uniqueness of solutions,
Well developed numerical methods for finding solutions, and
The potential to exploit sparseness of systems.
L1 regularization will be the first point where we really need non-linear optimization.
Down the road, neural networks lead to non-linear models requiring non-linear optimization. Even then, though, the model can be realized as sequence of steps, each of which consists of a linear function composed with non-linear function.
Multiple regression
The housing prices project from the other day illustrates the need to account for multiple inputs in regression. In general, we might have \(N\) data points each with \(m\) inputs and a single response that look like \[
\{((x_{1j},x_{2j},\ldots,x_{mj}),y_j)\}_{j=1}^N.
\] If we have a vector \(\mathbf{a} = \begin{bmatrix}
a_m & a_{m-1} & \cdots & a_1 & a_0
\end{bmatrix}^T\) of \(m+1\) coefficients, then our approximation and squared error might look like \[
E(\mathbf{a}) = \sum_{i=1}^N \left(\left(\sum_{j=1}^m a_j x_{ij} + a_0\right) - y_i\right)^2.
\]
Matrix formulation
We can reformulate the previous expression as a matrix problem; we just need a few more columns!
If we define the matrices that we see above as \(X\), \(\mathbf{a}\), and \(\mathbf{y}\), then we can find the least squares solution using the normal equations: \[X^T X \mathbf{a} = \mathbf{y}.\]
Another option to improve model accuracy is to use a different type of function to fit the data. Generally, though, you should have good reason to believe that some other type of function is a better fit to your data to use it.
In this column of slides, we’ll play with a data set from a 1995 article in the Journal of Statistics Education entitled “Teaching Statistics with Data of Historic Significance” The article presents data that is supposedly taken directly from Galileo’s notebooks in his early studies on gravitational motion. Given the context, it probably makes sense to fit a quadratic to the data.
Note that this is still linear regression, because a quadratic is a linear combination of \(1\), \(x\), and \(x^2\).
We’re going to go through the process of fitting some data D to a polynomial of degree n and displaying the resulting coefficients and graph several times so let’s gather the steps in a function:
def fit_and_plot(D, n): X = D[0,:]; Y = D[1,:] A = np.array([X**(k) for k inrange(n+1)]).transpose() coeff = np.array(solve(A.transpose().dot(A), A.transpose().dot(Y))) display(HTML(f"Coefficients: {coeff}")) f =lambda x: np.dot(coeff, [x**p for p inrange(n+1)]) x = np.linspace(np.min(X),np.max(X),500) y = [f(xx) for xx in x] plt.plot(x,y) plt.plot(X,Y, 'ok')
Linear fit
Here’s how fit_and_plot works at first order:
fit_and_plot(D, 1)
Coefficients: [-713.26188807 2.77907611]
Quadratic fit
And at second order, which works great for this data:
As your model starts to grow through either an increase of inputs or complexity, you run the risk of overfitting. This means that your model matches the peculiarities of the training data in a way that doesn’t extend to general data arising from the same source.
This is a problem that is tremendously important to understand.
In this column of slides, we’ll clearly see overfitting arise in a polynomial fit.
Some data
Here’s some randomly generated data that’s just a standard normal perturbation from purely linear data:
seed(1)X = np.array([i + random() for i inrange(7)])Y = np.array([i + random() for i inrange(7)])D = np.array([X,Y])
Linear fit
The data should have have a pretty good linear fit:
fit_and_plot(D,1)
Coefficients: [-0.13062239 1.11009494]
Higher order?
In general \(N\) data points can be fit exactly with a polynomial of degree \(N-1\). (I suppose this makes sense, since a polynomial of degree \(N-1\) has \(N\) coefficients to choose.)
So… our randomly generated data set has \(5\) points - perhaps, we should try to fit it with a fourth degree polynomial??
Hmm… Even though the polynomial fits the data perfectly, it can vary wildly between the data points.
Moral
Don’t increase model complexity without a sound theoretical basis.
Don’t test your model on training data.
We’ll discuss the test/train strategy and cross-validation shortly.
Regularization
Regularization is strategy to counter overfitting. It’s based on the observation that the coefficients of a polynomial fit tend to grow as the order of the polynomial grows. Thus, to avoid overfitting, we place a penalty on the size of the coefficients.
In functional terms, we might minimize \[
\|X\mathbf{a} - \mathbf{y}\|^2 + \lambda \|\mathbf{a}\|^2.
\]
Note that \(\|\cdot\|\) represents a norm. Common choices for the norm are L1 or L2, resulting in ridge regression or LASSO regression.
In either case, \(\lambda\) is a parameter; the larger \(\lambda\), the larger the influence of regularization.
Rephrased as a matrix problem
In ridge regression for a given \(\lambda\), the regularized minimization can be achieved by solving the linear system \[
(X^TX + \lambda I)\mathbf{a} = X^T \mathbf{y}.
\] This is a special property of the L2 norm, that doesn’t work for a more general norm, including L1. Furthermore we often treat \(\lambda\) itself as a parameter to be optimized, which breaks the linearity of the objective function.
Thus, in the sample code that follows, we’ll use SciPi’s nonlinear optimizer.
Example
Reall that we’ve already got a list of 7 data points that we generated randomly in the last column of slides with \(x\)-coordinates:
We’ve examined regularization in polynomial regression here because it nicely illustrates the effect. Multiple linear regression is used much more commonly in machine learning, though, so it’s important to understand that the same effect happens in that context as well.
Indeed, when the number of inputs goes up, accuracy improves. If the number of inputs reaches the number of observations, then you expect to be able to match the data exactly.
There’s a balance between accuracy and generality that can be a bit tricky to find.
Variable selection
A large number of input variables can lead to several problems, including
Overfitting (as we just saw),
increased computation time, and
ill-conditioned equations (like close to singular matrices).
Thus, we have good reason to attempt to reduce the number of variables, where appropriate.
Regularization and variable selection
It turns out that the process of regularization can shed light on variable selection and L1 regularization is particularly well suited for this task.
The common term LASSO to refer to L1 regularization is, in fact, an acronym standing for
Least Absolute Shrinkage and Selection Operator
In this column of slides, we’ll illustrate how this might work.
Reformulation
In order to understand how variable selection works, it helps to reformulate the regularized minimization problem as follows:
We minimize \[\|X\mathbf{a} - \mathbf{y}\|^2 \text{ subject to } \|\mathbf{a}\| < s,\] where \(s\) is a parameter related to \(\lambda\) from our original formulation.
This reformulation leads to a nice visualization illustrating why this might work.
An SSE function to regularize
OK, let’s generate a small data set with a Sum of Squared Errors function (SSE) to work with:
seed(1)X = np.array([i + random()/4for i inrange(3)])Y = np.array([i + random()/4for i inrange(3)])def F(coeff, x): # First order polynomialreturn np.dot(coeff, [x**p for p inrange(2)])def sse(coeff): # Sum squared errorreturnsum((F(coeff, X) - Y)**2)
SSE??
The sse function takes a list of two coefficients to define a first order polynomial. For each point in our data set with \(x\)-coordinates X and \(y\)-coordinates Y, sse then computes the sum of the squared error. Finally, sse returns the sum of those squared errors.
Here’s the squared error for the polynomial \(f(x) = 2x+1\), for example:
sse([1,2])
np.float64(15.54441908547873)
Minimum of the SSE
We can use the sse function in minimization commands. For example, it has a global minimum around
[x,y] = [-0.07, 1.02]:
Now, the L2 regularized minimum occurs at the point of tangency between a contour and a circle in the L2 norm. If the \(x\) and \(y\) inputs have a common scale, then it certainly seems that the \(y\)-coordinate is the more important of the two.
Here’s the corresponding picture in the L1 norm. We can see that the \(x\)-coordinate is effectively zero and (presumably) much less important than the \(y\)-coordinate. More importantly, we can see why this happens.
Recall that our housing data had 80 variables and it seems reasonable to reduce that number. SciKit-Learn can automate the LASSO process and help us with that variable selection.
Cross-validation
Finally, let’s mention a bit about cross-validation.
Generally, you fit your model to the training data but you evaluate your model with testing data.
In the context of a Kaggle competition, they ultimately evaluate your model using the testing data that they provide. There’s no reason you can attempt to evaluate your model first, though.
Strategy
For the rest of this column, let’s refer to the two types of data as
labeled data and
unlabeled data.
The idea is to fit and evaluate your model using the labeled data. Ultimately, you want to use it on unlabeled data.
To fit and evaluate a model using labeled data, we’ll break it into two parts:
Training data and
Testing data.
We’ll fit the training data and evaluate using the (labeled) testing data.
Cross-validation
In \(k\)-fold cross-validation, we randomly partition the labeled data into \(k\) pieces. We then set aside \(1\) piece for testing and use the other \(k-1\) pieces for training. This is done \(k\) times using each of the \(k\) pieces for testing.
For \(5\)-fold cross-validation, we run the process five times training with 80% of the data and testing with 20% each time.
Cross-validation in SciKit-Learn
Functions like ElasticNetCV in scikit-learn automate this procedure. A few select parameters (like regularization coefficients or L1 ratio) can be optimized using the cross-validation information. The final fit is then obtained by fitting the full data using the non-coefficient parameters.
In addition, information on the cross-validation results are returned in the regression object.
Detecting overfitting
You can detect overfitting in the context of linear regression quite concretely using cross-validation. Suppose we’ve got \(n\) observations and \(p\) predictors. If TRAIN_MSE and TEST_MSE denote the Mean Square Error fo the Train and Test sets, then we can prove that the expected value of TRAIN_MSE is \[
\text{TRAIN_MSE} = \frac{n-p-1}{n+p+1} \times \text{TEST_MSE}.
\] Thus, a significantly smaller value of TRAIN_MSE indicates overfitting.