# Regression¶

Last time, we talked about scatter plots of paired data sets and correlation. Today, we'll push that farther into regression.

## Another example¶

To emphasize the fact that we're working with paired data, let's take a look at our book store price example:

In [1]:

Again, this is paired data; each point corresponds to a single book which has two numeric values - it's bkstr.com price and it's amazon.com price.

### Key questions¶

We talked about the correlation $R$ before. Now we'll focus on the regression line $A = 0.68\times B + 3.18$, which is supposedly the "best linear fit" for the data. Specifically, we'll address

• In what sense is this line the "best linear fit"?
• When should we use regression?
• What is the connection between regression and correlation?
• How can we use the line to make predictions?
• How can we do all this on the computer?
• What inferences can we draw from regression?

## Least squares¶

What makes the regression line better than some other line?

Answer: It minimizes the total squared error.

More specifically, if we have observations $$(x_1, y_1),(x_2, y_2),\ldots,(x_n, y_n)$$ and we assume that our model is $Y=aX+b$, then the regression line chooses $a$ and $b$ so that $$(y_1-(ax_1+b))^2 + (y_2-(ax_2+b))^2 + \cdots + (y_n-(ax_n+b))^2$$ is as small as possible.

The terms $y_i-(ax_i+b)$ are the called the residuals; they measure how far off our regression model is at the data points.

## Conditions for linear regression¶

First and foremost, linear regression might be useful if you want to understand the relationship between two numerical variables. Furthermore, the data should satisfy a few more conditions:

• Linearity:
The data should look approximately linear. If there is a nonlinear trend, a more advanced regression method should be applied.
• Nearly normal residuals:
Generally the residuals must be nearly normal.
• Constant variability:
The variability of points around the least squares line remains roughly constant.
• Independent observations:
Be particularly cautious with time series data, which are sequential observations. Such data are rarely independent.

## Connection with correlation¶

It seems that correlation should be connected with slope - and it is, via the formula $$m = r\frac{s_y}{s_x}.$$ Furthermore, the regression line should go through the point $(\bar{x},\bar{y})$. These formulae make it relatively easy to compute a regression line for given data.

### Example¶

Suppose that $X$ is a data set with mean $90$ and standard deviation $5$; $Y$ is a data set with mean $74$ and standard deviation $4$. Furthermore, $X$ and $Y$ have a tight correlation of $r=0.85$.

• What is the regression line connecting $Y$ to $X$?
• What value of $Y$ does the regression line predict if $X=85$?

Solution

We know the regression line has the form $y=mx+b$. The slope is $$m = r\frac{s_y}{s_x} = 0.85 \times \frac{4}{5} = 0.68.$$ We now know the regression line has the more specific form $y=0.68x+b$. Once we know this, we can plug the means for $X$ and $Y$ to get $b$: $$74 = 0.68 \times 90 + b,$$ so $$b = 74 - 0.68 \times 90 = 12.8.$$ Thus, the final equation of the line is $$y = 0.68x +12.8.$$

To solve the second part, we simply plug $x=85$ into our formula for the line to find the corresponding value of $y$.

## Using Python to run a regression test¶

Let's discuss how we might interpret the following:

In [1]:
import pandas as pd
from scipy.stats import linregress

sam = df.sample(50, random_state=1)
lr = linregress(sam.height, sam.weight)
lr

Out[1]:
LinregressResult(slope=7.348414179104478, intercept=-325.7588619402986, rvalue=0.5802991747548291, pvalue=9.999372235799102e-06, stderr=1.4885403893397993)

#### Questions¶

• What is the formula relating weight to height?
• How can we visualize that formula?
• What does the formula predict for the weight of a man who is 72 inches tall?
• What hypothesis might we check with this model?
• Write down the result of the hypothesis test.

Here are the answers to the first couple of questions at least:

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(sam.height, sam.weight, '.')
x1 = 60
x2 = 77
m = lr.slope
b = lr.intercept
y1 = m*x1+b
y2 = m*x2+b
plt.plot([x1,x2],[y1,y2]);


### Code for HW¶

Finally, here's a little code that might simplify the data entry portion of your HW:

In [116]:
value_string = """15.9	9.3
15.8	7
17.4	10.9
16.2	11.1
16.4	11.1
17.6	27.9
19	19
19.9	18
19.1	18.1
19.4	19.7
18.2	27.8
21.8	21.1
21.5	31.9
21	30.2
22.4	30.6
23	26
22	28.8
24.8	30.1
26.5	38.5
26.9	36.1"""

values = [s.split('\t') for s in value_string.split('\n')]
pp = [float(v[0]) for v in values]
mm = [float(v[1]) for v in values]
[pp,mm]

Out[116]:
[[15.9,
15.8,
17.4,
16.2,
16.4,
17.6,
19.0,
19.9,
19.1,
19.4,
18.2,
21.8,
21.5,
21.0,
22.4,
23.0,
22.0,
24.8,
26.5,
26.9],
[9.3,
7.0,
10.9,
11.1,
11.1,
27.9,
19.0,
18.0,
18.1,
19.7,
27.8,
21.1,
31.9,
30.2,
30.6,
26.0,
28.8,
30.1,
38.5,
36.1]]

And here's an online version, in case you need it later: https://sagecell.sagemath.org/?q=uskwtw