Orthogonal least squares

and linear regression

We’ll discuss orthogonal projection at some length in class and it’s also covered in section 6.5 of our text. In this set of notes, we’ll present a few pictures and code to illustrate the ideas.

The code on this page uses the following Python libraries:

import pandas as pd # Data manipulation
import numpy as np  # Fundamental numerics
from scipy.linalg import solve    # Solve linear systems
import matplotlib.pyplot as plt   # Graphics

Linear regression and least squares

When it comes to data \(\{(x_i,y_i)\}\), the least squares problem is to find \(f(x)=ax+b\) that best fits that data. That is we’d like to minimize \[\sum (ax_i + b - y_i)^2\] or, from the perspective of linear algebra, we’d like to minimize the squared norm \[ \left\|\left(\begin{matrix} x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_n & 1 \end{matrix}\right) \left(\begin{matrix} a \\ b \end{matrix} \right) - \left(\begin{matrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{matrix}\right)\right\|^2. \] More generally, we’d like to solve \(A\mathbf{x} = \mathbf{y}\) in this least squares sense. As we learn in class, this can be done by solving the system \(A^T A \mathbf{x}=A^T\mathbf{y}\).

Basic example

Here’s a really small example that we saw in class:

x = [0,1,2]
y = [0,1,1]
plt.plot(x,y, 'ok')

The idea is to draw a line that’s the “best fit” to that data in the least squares sense. To do so, we first solve this system:

X = np.array([x,[1,1,1]]).transpose()
x_hat = solve(X.T @ X, X.T @ y)
a, b = x_hat
print(f"a = {a:.6f}")
print(f"b = {b:.6f}")

a = 0.500000
b = 0.166667

And here’s the line:

plt.plot(x,y, 'ok')
plt.plot([-0.2,2.2], [a*(-0.2)+b, a*2.2+b])

Here’s a plot in 3D space of

the target \(\mathbf{y}\),
the plane spanned by the vectors \([0\:1\:2]^{\mathsf T}\) and \([1\:1\:1]^{\mathsf T}\), and
the projection \(\hat{\mathbf{y}}\) of \(\mathbf{y}\) onto that plane.

Galileo’s ramp

Here’s an example illustrating the fact that “linear” doesn’t refer to the line we use to fit the data. In fact, can use any linear combination of functions. If we use the functions \(1\), \(x\), and \(x^2\), we obtain a quadratic fit to the data.

This example is based on Galileo’s 1995 experiments on projectile motion. In one such experiment, he let a ball roll down an inclined ramp right off of a table and measured its vertical distance traveled. He did this for several ramp heights and thus obtained a relationship between the height at which he released the ball and the vertical distance that the ball traveled.

I’ve got the actual data stored on my website! We can load and view it directly with Python like so:

df = pd.read_csv('https://marksmath.org/data/galileos_ramp.csv')
df

	H	D
0	1000	573
1	800	534
2	600	495
3	450	451
4	300	395
5	200	337
6	100	253

The unit of distance is the “punti”, according to the source:

Teaching Statistics with Data of Historic Significance: Galileo’s Gravity and Motion Experiments.

We can also plot the data:

plt.plot(df.D,df.H, '.')

While it rather flips the role of what we might call explanatory and response, it does look like height released is related to distance traveled via a quadratic - that is, \[H = a D^2 + b D + c.\] In order to fit the data, we could solve the least squares problem \[ \left(\begin{matrix} d_1^2 & d_1 & 1 \\ d_2^2 & d_2 & 1 \\ \vdots & \vdots & \vdots \\ d_n^2 & d_n & 1 \end{matrix}\right) \left(\begin{matrix}a \\ b \\ c \end{matrix} \right) \approx \left(\begin{matrix} h_1 \\ h_2 \\ \vdots \\ h_n \end{matrix}\right). \]

Of course, using the normal equations \(A^T A \mathbf{x} = A^T \mathbf{y}\), this boils down to solving the linear system

\[ \left(\begin{matrix} d_1^2 & d_2^2 & \cdots & d_n^2 \\ d_1 & d_2 & \cdots & d_n \\ 1 & 1 & \cdots & 1 \end{matrix}\right) \left(\begin{matrix} d_1^2 & d_1 & 1 \\ d_2^2 & d_2 & 1 \\ \vdots & \vdots & \vdots \\ d_n^2 & d_n & 1 \end{matrix}\right) \left(\begin{matrix}a \\ b \\ c \end{matrix} \right) = \left(\begin{matrix} d_1^2 & d_2^2 & \cdots & d_n^2 \\ d_1 & d_2 & \cdots & d_n \\ 1 & 1 & \cdots & 1 \end{matrix}\right) \left(\begin{matrix} h_1 \\ h_2 \\ \vdots \\ h_n \end{matrix}\right). \]

Here’s how to set up the matrix containing the values of \(D\) - the matrix traditionally called \(A\) in the normal equations:

A = np.array([df.D**2, df.D, df.D**0]).transpose()
A

array([[328329,    573,      1],
       [285156,    534,      1],
       [245025,    495,      1],
       [203401,    451,      1],
       [156025,    395,      1],
       [113569,    337,      1],
       [ 64009,    253,      1]])

We now solve the normal equations:

a,b,c = solve(A.transpose().dot(A), A.transpose().dot(df.H))
a,b,c

(np.float64(0.00818073977157367),
 np.float64(-4.0048429739353715),
 np.float64(600.0537247468285))

Thus we expect \[H \approx 0.00818073977 D^2 - 4.00484297 D + 600.05.\] Let’s define and plot that function together with data:

def f(x): return a*x**2 + b*x + c
xs = np.linspace(250,600)
ys = f(xs)
plt.plot(xs,ys)
plt.plot(df.D,df.H, '.')

Looks pretty good!

A larger example from the CDC

Often, this technique is used in the context of large data sets. Here’s some data from the CDC’s BRFSS:

cdc_data = pd.read_csv("http://marksmath.org/data/cdc.csv")
print(len(cdc_data))
cdc_data.head(12)

	genhlth	exerany	hlthplan	smoke100	height	weight	wtdesire	age	gender
0	good	0	1	0	70	175	175	77	m
1	good	0	1	1	64	125	115	33	f
2	good	1	1	1	60	105	105	49	f
3	good	1	1	0	66	132	124	42	f
4	very good	0	1	0	61	150	130	55	f
5	very good	1	1	0	64	114	114	55	f
6	very good	1	1	0	71	194	185	31	m
7	very good	0	1	0	67	170	160	45	m
8	good	0	1	1	65	150	130	27	f
9	good	1	1	0	70	180	170	44	m
10	excellent	1	1	1	69	186	175	46	m
11	fair	1	1	1	69	168	148	62	m

Let’s grab a sample of size 500 and plot it:

cdc_sample = cdc_data.sample(500, random_state=42)
plt.plot(cdc_sample.height, cdc_sample.weight, 'ok', alpha=0.3)

Finally, let’s find and plot the regression line with the data:

A = np.array([cdc_sample.height, np.ones(500)]).transpose()
a,b = solve(A.transpose().dot(A), A.transpose().dot(cdc_sample.weight))
plt.plot(cdc_sample.height, cdc_sample.weight, 'ok', alpha=0.3)
plt.plot([58,81],[a*58+b, a*81+b])

Problems??

Here are a couple of potential quiz like problems over this material. Recall that we will also have a computer lab involving this stuff on Wednesday.

Explain how \(A^{\mathsf T}A \mathbf{x} = A^{\mathsf T} \mathbf{y}\) yields a vector that’s orthogonal to the column space of \(A\).
Given the data \[\{(-1,-1), (0,1), (1,1), (3,2)\},\] write down the resulting normal equations.