Regression

Last time, we talked about scatter plots of paired data sets and correlation. Today, we’ll push that farther into regression.

Another example

To emphasize the fact that we’re working with paired data, let’s take a look at our book store price example:

Again, this is paired data; each point corresponds to a single book which has two numeric values - it’s bkstr.com price and it’s amazon.com price.

Key questions

We talked about the correlation \(R\) before. Now we’ll focus on the regression line \(A = 0.68\times B + 3.18\), which is supposedly the “best linear fit” for the data. Specifically, we’ll address

In what sense is this line the “best linear fit”?
When should we use regression?
What is the connection between regression and correlation?
How can we use the line to make predictions?
How can we do all this on the computer?
What inferences can we draw from regression?

Least squares

What makes the regression line better than some other line?

Answer: It minimizes the total squared error.

More specifically, if we have observations \[(x_1, y_1),(x_2, y_2),\ldots,(x_n, y_n)\] and we assume that our model is \(Y=aX+b\), then the regression line chooses \(a\) and \(b\) so that \[(y_1-(ax_1+b))^2 + (y_2-(ax_2+b))^2 + \cdots + (y_n-(ax_n+b))^2\] is as small as possible.

Residuals

The terms \(y_i-(ax_i+b)\) are the called the residuals; they measure how far off our regression model is at the data points.

Often, examining the residuals of a scatter plot can be illuminating. Here are several scatter plots together with scatter plots of their residuals. The scatter plot on the left below looks slightly non-linear; the scatter plot of the residuals make this much more clear.

It can also be illuminating to look at a histogram of the residuals. The histogram below shows the distribution of the residuals for our weight estimate example. Since it appears to be normal, we can apply our knowledge of the normal distribution to obtain confidence in our estimates.

Conditions for linear regression

First and foremost, linear regression might be useful if you want to understand the relationship between two numerical variables. Furthermore, the data should satisfy a few more conditions:

Linearity:
The data should look approximately linear. If there is a nonlinear trend, a more advanced regression method should be applied.
Nearly normal residuals:
Generally the residuals must be nearly normal.
Constant variability:
The variability of points around the least squares line remains roughly constant.
Independent observations:
Be particularly cautious with time series data, which are sequential observations. Such data are rarely independent.

Connection with correlation

It seems that correlation should be connected with slope - and it is, via the formula \[m = r\frac{s_y}{s_x}.\] Furthermore, the regression line should go through the point \((\bar{x},\bar{y})\). These formulae make it relatively easy to compute a regression line for given data.

Example

Suppose that \(X\) is a data set with mean \(90\) and standard deviation \(5\); \(Y\) is a data set with mean \(74\) and standard deviation \(4\). Furthermore, \(X\) and \(Y\) have a tight correlation of \(r=0.85\).

What is the regression line connecting \(Y\) to \(X\)?
What value of \(Y\) does the regression line predict if \(X=85\)?

Solution

We know the regression line has the form \(y=mx+b\). The slope is \[m = r\frac{s_y}{s_x} = 0.85 \times \frac{4}{5} = 0.68.\] We now know the regression line has the more specific form \(y=0.68x+b\). Once we know this, we can plug the means for \(X\) and \(Y\) to get \(b\): \[74 = 0.68 \times 90 + b, \] so \[b = 74 - 0.68 \times 90 = 12.8.\] Thus, the final equation of the line is \[y = 0.68x +12.8.\]

To solve the second part, we simply plug \(x=85\) into our formula for the line to find the corresponding value of \(y\).

Using R to run a regression test

Let’s discuss how we might interpret the following:

set.seed(1)
cdc = read.csv("https://www.marksmath.org/data/cdc.csv")
men = subset(cdc, gender=='m')
subset = men[sample(1:length(men$height),50),]
cdc_fit = lm(subset$weight~subset$height)
summary(cdc_fit)

## 
## Call:
## lm(formula = subset$weight ~ subset$height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -73.120 -23.716  -8.848  17.896  93.392 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -122.182    144.958  -0.843   0.4035  
## subset$height    4.504      2.043   2.205   0.0323 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.3 on 48 degrees of freedom
## Multiple R-squared:  0.09198,    Adjusted R-squared:  0.07307 
## F-statistic: 4.862 on 1 and 48 DF,  p-value: 0.03227

Questions

What is the formula relating weight to height?
What does the formula predict for the weight of a man who is 72 inches tall?
What hypothesis might we check with this model?
Write down the result of the hypothesis test.