Last time, we explored scatter plots and correlation. Today, we’ll use a numerical tool called linear regression to quantify the patterns we see in a scatter plot. While correlation can help us see if there’s a linear relationship, regression can help us postulate a formula for that relationship

Our real world examples revisited

Hurricane prediction

This scatter plot relating error in hurricane prediction to the year the prediction was made uses the same data we used before. In addition to the scatter plot and the correlation coefficient \(R\), we see a formula for a line and a graph of that line superimposed on the data.

The correlation of about \(-0.83\) is a quantitative assessment of the relationship between the variables.

The formula \(E=-8.33Y+16870.37\) yields an estimate of the error \(E\) in terms of the year \(Y\).

Height vs weight

Here’s another look at our CDC data relating height and weight.

Again, the correlation of about \(0.47\) is a quantitative assessment of the relationship between the variables.

The formula \(W=5.38H-190\) yields an estimate of the weight \(W\) in terms of the height \(H\).

Galileo’s ramp

Even though the correlation is high for Galileo’s ramp, a quadratic fit is much better than a linear fit.

A little mathematics

There are a few mathematical details you should know to understand linear regression.

Equations of lines

The graph of an equation of the form \(y=ax+b\) is a line with slope \(a\) and \(y\)-intercept \(b\). Here’s a simple example.

We typically say that \(x\) is the independent variable and that \(y\) is the dependent variable. In stat speak, this is … ?

Given an equation like this, it’s easy to compute the dependent variable in terms of the independent. If \(y=2x+1\) and \(x=3\), then \(y = 2\times3+1 = 7\).

Prediction

In the context of working with data, the regression line can be used to predict values for the response variable given a value of the explanatory variable that is missing or outside the range of the given data.

Example

The heights in our CDC data are all rounded to inches. What does the model predict should be the height of a man who is 5’10 1/2’’ tall?

Solution: The formula for the regression line is \(w = 5.38h - 190.21\). When the height \(h\) is \(70.5\), we get a weight of \(w=189\) pounds.

Note that the formula is reasonably good over a range of values but not good for all values.

Question:

What does our hurricane forecast model predict for our error in 2020?

Least squares

What makes the regression line better than some other line? It minimizes the total squared error.

More specifically, if we have observations \[(x_1, y_1),(x_2, y_2),\ldots,(x_n, y_n)\] and we assume that our model is \(Y=aX+b\), then the regression line chooses \(a\) and \(b\) so that \[(y_1-(ax_1+b))^2 + (y_2-(ax_2+b))^2 + \cdots + (y_n-(ax_n+b))^2\] is as small as possible.

Residuals

The terms \(y_i-(ax_i+b)\) are the called the residuals; they measure how far off our regression model is at the data points.

Often, examining the residuals of a scatter plot can be illuminating. Here are several scatter plots together with scatter plots of their residuals. The scatter plot on the left below looks slightly non-linear; the scatter plot of the residuals make this much more clear.

It can also be illuminating to look at a histogram of the residuals. The histogram below shows the distribution of the residuals for our weight estimate example. Since it appears to be normal, we can apply our knowledge of the normal distribution to obtain confidencde in our estimates.

When should we use linear regression?

We will learn quite a few techniques to analyze data this semester. It’s always good idea to think carefully about which technique you want to use and whether it’s genuinely applicable.

Conditions for linear regression:

First and foremost, linear regression might be useful if you want to understand the relationship between two numerical variables. Furthemore, the data should satisfy a few more conditions:

  • Linearity:
    The data should look approximately linear. If there is a nonlinear trend, a more advanced regression method should be applied.
  • Nearly normal residuals:
    Generally the residuals must be nearly normal.
  • Constant variability:
    The variability of points around the least squares line remains roughly constant.
  • Independent observations:
    Be particularly cautious with time series data, which are sequential observations. Such data are rarely independent.

Connection with correlation

It seems that correlation should be connected with slope - and it is, via the formula \[m = r\frac{s_y}{s_x}.\] Furthermore, the regression line should go through the point \((\bar{x},\bar{y})\). These formulae make it relatively easy to compute a regression line for given data.

Example

Suppose that \(X\) is a data set with mean \(90\) and standard deviation \(5\); \(Y\) is a data set with mean \(74\) and standard deviation \(4\). Furthermore, \(X\) and \(Y\) have a tight correlation of \(r=0.85\).

  • What is the regression line connecting \(Y\) to \(X\)?
  • What value of \(Y\) does the regression line predict if \(X=85\)?

Solution

We know the regression line has the form \(y=mx+b\). The slope is \[m = r\frac{s_y}{s_x} = 0.85 \times \frac{4}{5} = 0.68.\] We now know the regression line has the more specific form \(y=0.68x+b\). Once we know this, we can plug the means for \(X\) and \(Y\) to get \(b\): \[74 = 0.68 \times 90 + b, \] so \[b = 74 - 0.68 \times 90 = 12.8.\] Thus, the final equation of the line is \[y = 0.68x +12.8.\]

To solve the second part, we simply plug \(x=85\) into our formula for the line to find the corresponding value of \(y\).