Orthogonal least squares

and linear regression

We’ll discuss orthogonal projection at some length in class and it’s also covered in section 6.5 of our text. In this set of notes, we’ll present a few pictures and code to illustrate the ideas.

The code on this page uses the following Python libraries:

import pandas as pd # Data manipulation
import numpy as np  # Fundamental numerics
from scipy.linalg import solve    # Solve linear systems
import matplotlib.pyplot as plt   # Graphics

Linear regression and least squares

When it comes to data \(\{(x_i,y_i)\}\), the least squares problem is to find \(f(x)=ax+b\) that best fits that data. That is we’d like to minimize \[\sum (ax_i + b - y_i)^2\] or, from the perspective of linear algebra, we’d like to minimize the squared norm \[ \left\|\left(\begin{matrix} x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_n & 1 \end{matrix}\right) \left(\begin{matrix} a \\ b \end{matrix} \right) - \left(\begin{matrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{matrix}\right)\right\|^2. \] More generally, we’d like to solve \(A\mathbf{x} = \mathbf{y}\) in this least squares sense. As we learn in class, this can be done by solving the system \(A^T A \mathbf{x}=A^T\mathbf{y}\).

Basic example

Here’s a really small example that we saw in class:

x = [0,1,2]
y = [0,1,1]
plt.plot(x,y, 'ok')

The idea is to draw a line that’s the “best fit” to that data in the least squares sense. To do so, we first solve this system:

X = np.array([x,[1,1,1]]).transpose()
x_hat = solve(X.T @ X, X.T @ y)
a, b = x_hat
print(f"a = {a:.6f}")
print(f"b = {b:.6f}")
a = 0.500000
b = 0.166667

And here’s the line:

plt.plot(x,y, 'ok')
plt.plot([-0.2,2.2], [a*(-0.2)+b, a*2.2+b])

Here’s a plot in 3D space of

  • the target \(\mathbf{y}\),
  • the plane spanned by the vectors \([0\:1\:2]^{\mathsf T}\) and \([1\:1\:1]^{\mathsf T}\), and
  • the projection \(\hat{\mathbf{y}}\) of \(\mathbf{y}\) onto that plane.

Galileo’s ramp

Here’s an example illustrating the fact that “linear” doesn’t refer to the line we use to fit the data. In fact, can use any linear combination of functions. If we use the functions \(1\), \(x\), and \(x^2\), we obtain a quadratic fit to the data.

This example is based on Galileo’s 1995 experiments on projectile motion. In one such experiment, he let a ball roll down an inclined ramp right off of a table and measured its vertical distance traveled. He did this for several ramp heights and thus obtained a relationship between the height at which he released the ball and the vertical distance that the ball traveled.

I’ve got the actual data stored on my website! We can load and view it directly with Python like so:

df = pd.read_csv('https://marksmath.org/data/galileos_ramp.csv')
df
H D
0 1000 573
1 800 534
2 600 495
3 450 451
4 300 395
5 200 337
6 100 253

The unit of distance is the “punti”, according to the source:

We can also plot the data:

plt.plot(df.D,df.H, '.')

While it rather flips the role of what we might call explanatory and response, it does look like height released is related to distance traveled via a quadratic - that is, \[H = a D^2 + b D + c.\] In order to fit the data, we could solve the least squares problem \[ \left(\begin{matrix} d_1^2 & d_1 & 1 \\ d_2^2 & d_2 & 1 \\ \vdots & \vdots & \vdots \\ d_n^2 & d_n & 1 \end{matrix}\right) \left(\begin{matrix}a \\ b \\ c \end{matrix} \right) \approx \left(\begin{matrix} h_1 \\ h_2 \\ \vdots \\ h_n \end{matrix}\right). \]

Of course, using the normal equations \(A^T A \mathbf{x} = A^T \mathbf{y}\), this boils down to solving the linear system

\[ \left(\begin{matrix} d_1^2 & d_2^2 & \cdots & d_n^2 \\ d_1 & d_2 & \cdots & d_n \\ 1 & 1 & \cdots & 1 \end{matrix}\right) \left(\begin{matrix} d_1^2 & d_1 & 1 \\ d_2^2 & d_2 & 1 \\ \vdots & \vdots & \vdots \\ d_n^2 & d_n & 1 \end{matrix}\right) \left(\begin{matrix}a \\ b \\ c \end{matrix} \right) = \left(\begin{matrix} d_1^2 & d_2^2 & \cdots & d_n^2 \\ d_1 & d_2 & \cdots & d_n \\ 1 & 1 & \cdots & 1 \end{matrix}\right) \left(\begin{matrix} h_1 \\ h_2 \\ \vdots \\ h_n \end{matrix}\right). \]

Here’s how to set up the matrix containing the values of \(D\) - the matrix traditionally called \(A\) in the normal equations:

A = np.array([df.D**2, df.D, df.D**0]).transpose()
A
array([[328329,    573,      1],
       [285156,    534,      1],
       [245025,    495,      1],
       [203401,    451,      1],
       [156025,    395,      1],
       [113569,    337,      1],
       [ 64009,    253,      1]])

We now solve the normal equations:

a,b,c = solve(A.transpose().dot(A), A.transpose().dot(df.H))
a,b,c
(np.float64(0.00818073977157367),
 np.float64(-4.0048429739353715),
 np.float64(600.0537247468285))

Thus we expect \[H \approx 0.00818073977 D^2 - 4.00484297 D + 600.05.\] Let’s define and plot that function together with data:

def f(x): return a*x**2 + b*x + c
xs = np.linspace(250,600)
ys = f(xs)
plt.plot(xs,ys)
plt.plot(df.D,df.H, '.')

Looks pretty good!

A larger example from the CDC

Often, this technique is used in the context of large data sets. Here’s some data from the CDC’s BRFSS:

cdc_data = pd.read_csv("http://marksmath.org/data/cdc.csv")
print(len(cdc_data))
cdc_data.head(12)
20000
genhlth exerany hlthplan smoke100 height weight wtdesire age gender
0 good 0 1 0 70 175 175 77 m
1 good 0 1 1 64 125 115 33 f
2 good 1 1 1 60 105 105 49 f
3 good 1 1 0 66 132 124 42 f
4 very good 0 1 0 61 150 130 55 f
5 very good 1 1 0 64 114 114 55 f
6 very good 1 1 0 71 194 185 31 m
7 very good 0 1 0 67 170 160 45 m
8 good 0 1 1 65 150 130 27 f
9 good 1 1 0 70 180 170 44 m
10 excellent 1 1 1 69 186 175 46 m
11 fair 1 1 1 69 168 148 62 m

Let’s grab a sample of size 500 and plot it:

cdc_sample = cdc_data.sample(500, random_state=42)
plt.plot(cdc_sample.height, cdc_sample.weight, 'ok', alpha=0.3)

Finally, let’s find and plot the regression line with the data:

A = np.array([cdc_sample.height, np.ones(500)]).transpose()
a,b = solve(A.transpose().dot(A), A.transpose().dot(cdc_sample.weight))
plt.plot(cdc_sample.height, cdc_sample.weight, 'ok', alpha=0.3)
plt.plot([58,81],[a*58+b, a*81+b])

Problems??

Here are a couple of potential quiz like problems over this material. Recall that we will also have a computer lab involving this stuff on Wednesday.

  1. Explain how \(A^{\mathsf T}A \mathbf{x} = A^{\mathsf T} \mathbf{y}\) yields a vector that’s orthogonal to the column space of \(A\).

  2. Given the data \[\{(-1,-1), (0,1), (1,1), (3,2)\},\] write down the resulting normal equations.