And application to least squares
Wed, Feb 12, 2025
Today we’re going to discuss orthogonal projection and its application to the least squares problems. This will yield a highly efficient approach to linear regression.
\[ \newcommand{\vect}[1]{\mathbf{#1}} \]
Let’s suppose we’ve got an \(m\)-dimensional subspace \(U\) of \(\mathbb R^n\). Our objective is to find a function \(P:\mathbb R^n \to U\) that represents “orthogonal projection”, whatever that means…
A linear transformation \(P:\mathbb R^n \to \mathbb R^n\) is called a projection, if \(P^2=P\). Intuitively, this means that once you apply \(P\) to a vector \(\vect{x}\), reapplication of \(P\) doesn’t move the result further: \(P(P(\vect{x})) = P(\vect{x})\).
When we say that \(P\) is not just a projection but an orthgonal projection, we mean that \(P\) maps perpendicularly to the subspace. Algebraically, \[(P(\vect{x})- \vect{x}) \cdot P(\vect{x}) = 0.\]
Note that we’ll mostly use the bold vector \(\vect{x}\) notation, rather than the vector hat notation today.
All this makes a fair bit more sense, once you see an illustration. And note that, in the picture below, the projection \(P(\vect{x})\) is the point on the plane that is closest to \(\vect{x}\).
Since two vectors determine a plane, we can visualize the projection onto a one-dimensional subspace in a two-dimensional picture:
Since \(P(\vect{x})\) lies on the the subspace determined by \(\vect{b}\), we can write \[P(\vect{x}) = \lambda \vect{b}\] for some scalar \(\lambda\). The orthogonality condition then implies that \[ \langle P(\vect{x})-\vect{x}, \vect{b} \rangle = \langle \lambda\vect{x}-\vect{x}, \vect{b} \rangle = 0. \] We can solve that for \(\lambda\) to get \[ \lambda = \frac{\langle \vect{b}, \vect{x} \rangle}{\langle \vect{b}, \vect{b} \rangle} \text{ so that } P(\vect{x}) = \frac{\langle \vect{b}, \vect{x} \rangle}{\langle \vect{b}, \vect{b} \rangle} \vect{b}. \]
Suppose we wish to project \(\mathbf{x}\) onto \(\mathbf{b}\), where \[ \mathbf{b} = \begin{bmatrix} 1 \\ -1 \\ 2 \end{bmatrix} \quad \text{and} \quad \mathbf{x} = \begin{bmatrix} 3 \\ 4 \\ 5 \end{bmatrix}. \] Note that
\[P(\vect{x}) = \frac{9}{6}[1,-1,2]^T = [3/2,-3/2,6]^T.\]
Projection is a linear operation so we should be able to find a matrix \(B\) with the property that \(P(\vect{x}) = B\vect{x}\). To do so, we’ll need a matrix representation of the inner product so let’s suppose that it’s just the regular dot product given by \[ \langle \vect{x},\vect{y} \rangle = \mathbf{x}^T\mathbf{y}, \] Then, our formula becomes \[\begin{aligned} P(\mathbf{x}) = \frac{\langle \mathbf{b}, \mathbf{x} \rangle}{\langle \mathbf{b}, \mathbf{b} \rangle} \mathbf{b} = \mathbf{b}\frac{\mathbf{b}^T\mathbf{x}}{\mathbf{b}^T\mathbf{b}} = \frac{\mathbf{b}\mathbf{b}^T}{\mathbf{b}^T\mathbf{b}}\mathbf{x} \end{aligned}\] Thus, \[ B = \frac{\mathbf{b}\mathbf{b}^T}{\mathbf{b}^T\mathbf{b}}. \]
Continuing with our working example \[ \mathbf{b} = \begin{bmatrix} 1 \\ -1 \\ 2 \end{bmatrix} \quad \text{and} \quad \mathbf{x} = \begin{bmatrix} 3 \\ 4 \\ 5 \end{bmatrix}, \] note that \(\mathbf{b}^T\mathbf{b} = 6\) and \[ \mathbf{b}\mathbf{b}^T = \begin{bmatrix} 1 \\ -1 \\ 2 \end{bmatrix}\begin{bmatrix} 1 & -1 & 2 \end{bmatrix} =\begin{bmatrix} 1 & -1 & 2 \\ -1 & 1 & -2 \\ 2 & -2 & 4 \end{bmatrix}. \]
Here’s an image of the projection of the point/vector \(\vect{x}\) onto the line spanned by \(\vect{b}\):
We now turn to the question of projecting a point in \(\mathbb R^n\) onto an \(m\)-dimensional subspace \(U \subset \mathbb R^n\). Generally, the subspace \(U\) is expressed as the span of a collection of basis vectors; thus, we’ll assume that we have vectors \[\vect{b}_1,\vect{b}_2,\ldots,\vect{b}_m\] that span \(U\). Now, if \(\vect{x}\in\mathbb R^n\) and \(P:\mathbb R^n \to U\) is a projection, then we have \[ P(\vect{x}) = \lambda_1\vect{b}_1 + \lambda_2\vect{b}_2 + \cdots + \lambda_n\vect{b}_n. \]
Note the last line of the previous slide \[ P(\vect{x}) = \lambda_1\vect{b}_1 + \lambda_2\vect{b}_2 + \cdots + \lambda_m\vect{b}_m. \] expresses the fact that \(P(\vect{x})\) is a linear combination of the \(m\) basis vectors for the subspace \(U\). Thus, we could write \[P(\vect{x}) = B\vect{\lambda},\] where the columns of \(B\) are exactly the basis vectors of \(U\). Note that \(B\) is \(n\times m\) and that the orthogonality condition translates to \[B^T (\vect{x}-B\vect{\lambda}) = \vect{0}.\]
We’ve arrived at \(B^T (B\vect{\lambda}-\vect{x}) = \vect{0},\) where
Of course, this looks a lot like our one-dimensional formula \(\langle \lambda\vect{x}-\vect{x}, \vect{b} \rangle = 0\), if we wrote that as \(\vect{b}^T (\lambda\vect{x}-\vect{x})=0\). In fact, the higer dimensional formula is exactly a list of \(m\) of the one-dimensional formulae codified into a matrix.
Finally, we can rewrite \(B^T (B\vect{\lambda}-\vect{x}) = \vect{0}\) as \[B^TB\vect{\lambda} = B^T\vect{x}.\] We can solve that equation for \(\lambda\) quite efficiently using row reduction and back-substitution; then \(B\vect{\lambda}\) is the projection of \(\vect{x}\) onto the subspace \(U\).
If one really wants the projection matrix, they could invert \(B^TB\) to solve for \(\lambda\) to express \[B\vect{\lambda} = B(B^TB)^{-1}B^T\vect{x}.\] Thus, \(B(B^TB)^{-1}B^T\) is the projection matrix. While that’s of theoretical interest, it’s rarely done, since matrix inversion is very inefficient.
Suppose we’re working in \(\mathbb R^3\) and we’d like to project the vector \(\vect{x}\) onto the two-dimensional subspace spanned by \(\vect{b}_1\) and \(\vect{b}_2\), where \[ \vect{x} = \begin{bmatrix} 4 \\ 5 \\ 6 \end{bmatrix}, \quad \vect{b}_1 = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}, \text{ and } \vect{b}_2 = \begin{bmatrix} 2 \\ -1 \\ 1 \end{bmatrix}. \] We do so by solving the normal equations \[ B^T B \vect{\lambda} = B^T x \] and then computing \(B\vect{\lambda}\).
Well, \[ B^T B = \begin{bmatrix} 1 & 2 & 3 \\ 2 & -1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 2 \\ 2 & -1 \\ 3 & 1 \end{bmatrix} = \begin{bmatrix} 14 & 3 \\ 3 & 6 \end{bmatrix} \] and \[ B^T x = \begin{bmatrix} 1 & 2 & 3 \\ 2 & -1 & 1 \end{bmatrix} \begin{bmatrix} 4 \\ 5 \\ 6 \end{bmatrix} = \begin{bmatrix} 32 \\ 9 \end{bmatrix} \]
Thus, we’ve got to solve \[ \begin{bmatrix} 14 & 3 \\ 3 & 6 \end{bmatrix} \begin{bmatrix} \lambda_1 \\ \lambda_2 \end{bmatrix} = \begin{bmatrix} 32 \\ 9 \end{bmatrix} \] and we find that \[ \lambda_1 = \frac{11}{5}, \quad \lambda_2 = \frac{2}{5}. \] Thus, finally, \[ P(\vect{x}) = B \vect{\lambda} = \begin{bmatrix} 1 & 2 \\ 2 & -1 \\ 3 & 1 \end{bmatrix} \begin{bmatrix} \frac{11}{5} \\ \frac{2}{5} \end{bmatrix} = \begin{bmatrix} 3 \\ 4 \\ 7 \end{bmatrix} \]
Now, let’s recall the basic linear regression problem. We’re given data represented as a list of points:
\[ \{(x_i,y_i)\}_{i=1}^N = \{(x_1,y_1),(x_2,y_2),(x_3,y_3),\ldots,(x_N,y_N)\}. \]
We model that data with a function \(f(x) = ax+b\) and wish to minimize the error as a function of the parameters \(a\) and \(b\):
\[ E(a,b) = \sum_{i=1}^N \left((a\,x_i + b) - y_i\right)^2 \]
Note that the expression for the error \[ E(a,b) = \sum_{i=1}^N \left((a\,x_i + b) - y_i\right)^2 \] is exactly equivalent to \[ \left\|\begin{pmatrix} x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_N & 1 \end{pmatrix} \begin{bmatrix} a \\ b \end{bmatrix} - \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{pmatrix}\right\|^2. \] Thus, to minimize the error, we should minimize that norm, which is equivalent to orthogonal projection.
Suppose we wish to find the least square line approximating the points below. Be sure to examine the code!
To solve this, we’ll setup the matrices \(X\) and \(Y\) together with the vector \(\vect{a}\) defined by \[ X = \begin{bmatrix} x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_N & 1 \end{bmatrix}, \quad Y = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}, \quad \vect{a} = \begin{bmatrix} a_1 \\ a_0 \end{bmatrix}. \] We’ll then solve the least squares problem \[ X\vect{a}=Y \text{ by solving } X^TX\vect{a} = X^T\vect{Y}. \]
Here’s the setup of the matrix \(X\) in code:
[[-1.6595599059485195, 1.0],
[4.4064898688431615, 1.0],
[-9.997712503653101, 1.0],
[-3.9533485473632046, 1.0],
[-7.064882183657739, 1.0],
[-8.153228104624045, 1.0],
[-6.274795772446582, 1.0],
[-3.0887854591390447, 1.0],
[-2.0646505153866013, 1.0],
[0.7763346800671389, 1.0],
[-1.6161097119341044, 1.0],
[3.7043900079351904, 1.0],
[-5.910955005369651, 1.0],
[7.562348727818907, 1.0],
[-9.452248136041476, 1.0],
[3.409350203568044, 1.0],
[-1.6539039526574602, 1.0],
[1.1737965689150336, 1.0],
[-7.192261228095324, 1.0],
[-6.037970218302425, 1.0]]
Now, the scipy.linalg
module has a solve
function, that’s built specifically for linear systems. If the python variables A
and b
store the matrix and vector for the mathematical system \(A\vect{x}=\vect{b}\), then we simply call
solve(A,b)
Of course, we need to apply this to our system \(X^TX\vect{a} = X^T\vect{Y}\):
(0.48795039641459975, 0.8608633772379041)
This indicates that our regression line is \(y=0.48795x+0.86086\).
Naturally, we want to see a plot to make sure this works!
Comments
The condition that \(P^2 = P\) ensures that the space that we’re projecting onto is fixed.
The condition that \[(P(\vect{x})- \vect{x}) \cdot P(\vect{x}) = 0\] ensures that the projection is perpendicular.
We’re going to use these conditions to