Regression

Last time, we talked about scatter plots of paired data sets and correlation. Today, we'll push that farther into regression.

Another example

To emphasize the fact that we're working with paired data, let's take a look at our book store price example:

In [1]:

Again, this is paired data; each point corresponds to a single book which has two numeric values - it's bkstr.com price and it's amazon.com price.

Key questions

We talked about the correlation $R$ before. Now we'll focus on the regression line $A = 0.68\times B + 3.18$, which is supposedly the "best linear fit" for the data. Specifically, we'll address

  • In what sense is this line the "best linear fit"?
  • When should we use regression?
  • What is the connection between regression and correlation?
  • How can we use the line to make predictions?
  • How can we do all this on the computer?
  • What inferences can we draw from regression?

Least squares

What makes the regression line better than some other line?

Answer: It minimizes the total squared error.

More specifically, if we have observations $$(x_1, y_1),(x_2, y_2),\ldots,(x_n, y_n)$$ and we assume that our model is $Y=aX+b$, then the regression line chooses $a$ and $b$ so that $$(y_1-(ax_1+b))^2 + (y_2-(ax_2+b))^2 + \cdots + (y_n-(ax_n+b))^2$$ is as small as possible.

The terms $y_i-(ax_i+b)$ are the called the residuals; they measure how far off our regression model is at the data points.

Conditions for linear regression

First and foremost, linear regression might be useful if you want to understand the relationship between two numerical variables. Furthermore, the data should satisfy a few more conditions:

  • Linearity:
    The data should look approximately linear. If there is a nonlinear trend, a more advanced regression method should be applied.
  • Nearly normal residuals:
    Generally the residuals must be nearly normal.
  • Constant variability:
    The variability of points around the least squares line remains roughly constant.
  • Independent observations:
    Be particularly cautious with time series data, which are sequential observations. Such data are rarely independent.

Connection with correlation

It seems that correlation should be connected with slope - and it is, via the formula $$m = r\frac{s_y}{s_x}.$$ Furthermore, the regression line should go through the point $(\bar{x},\bar{y})$. These formulae make it relatively easy to compute a regression line for given data.

Example

Suppose that $X$ is a data set with mean $90$ and standard deviation $5$; $Y$ is a data set with mean $74$ and standard deviation $4$. Furthermore, $X$ and $Y$ have a tight correlation of $r=0.85$.

  • What is the regression line connecting $Y$ to $X$?
  • What value of $Y$ does the regression line predict if $X=85$?

Solution

We know the regression line has the form $y=mx+b$. The slope is $$m = r\frac{s_y}{s_x} = 0.85 \times \frac{4}{5} = 0.68.$$ We now know the regression line has the more specific form $y=0.68x+b$. Once we know this, we can plug the means for $X$ and $Y$ to get $b$: $$74 = 0.68 \times 90 + b, $$ so $$b = 74 - 0.68 \times 90 = 12.8.$$ Thus, the final equation of the line is $$y = 0.68x +12.8.$$

To solve the second part, we simply plug $x=85$ into our formula for the line to find the corresponding value of $y$.

Using Python to run a regression test

Let's discuss how we might interpret the following:

In [1]:
import pandas as pd
from scipy.stats import linregress

df = pd.read_csv("https://www.marksmath.org/data/cdc.csv")
sam = df.sample(50, random_state=1)
lr = linregress(sam.height, sam.weight)
lr
Out[1]:
LinregressResult(slope=7.348414179104478, intercept=-325.7588619402986, rvalue=0.5802991747548291, pvalue=9.999372235799102e-06, stderr=1.4885403893397993)

Questions

  • What is the formula relating weight to height?
  • How can we visualize that formula?
  • What does the formula predict for the weight of a man who is 72 inches tall?
  • What hypothesis might we check with this model?
  • Write down the result of the hypothesis test.

Here are the answers to the first couple of questions at least:

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(sam.height, sam.weight, '.')
x1 = 60
x2 = 77
m = lr.slope
b = lr.intercept
y1 = m*x1+b
y2 = m*x2+b
plt.plot([x1,x2],[y1,y2]);

Code for HW

Finally, here's a little code that might simplify the data entry portion of your HW:

In [116]:
value_string = """15.9	9.3
15.8	7
17.4	10.9
16.2	11.1
16.4	11.1
17.6	27.9
19	19
19.9	18
19.1	18.1
19.4	19.7
18.2	27.8
21.8	21.1
21.5	31.9
21	30.2
22.4	30.6
23	26
22	28.8
24.8	30.1
26.5	38.5
26.9	36.1"""

values = [s.split('\t') for s in value_string.split('\n')]
pp = [float(v[0]) for v in values]
mm = [float(v[1]) for v in values]
[pp,mm]
Out[116]:
[[15.9,
  15.8,
  17.4,
  16.2,
  16.4,
  17.6,
  19.0,
  19.9,
  19.1,
  19.4,
  18.2,
  21.8,
  21.5,
  21.0,
  22.4,
  23.0,
  22.0,
  24.8,
  26.5,
  26.9],
 [9.3,
  7.0,
  10.9,
  11.1,
  11.1,
  27.9,
  19.0,
  18.0,
  18.1,
  19.7,
  27.8,
  21.1,
  31.9,
  30.2,
  30.6,
  26.0,
  28.8,
  30.1,
  38.5,
  36.1]]

And here's an online version, in case you need it later: https://sagecell.sagemath.org/?q=uskwtw