Overview

Regression is a technique for determining if there is a relationship among variables and, if so, estimating what that relationship might be.

In the simplest case, we might hope that the relationship is linear; thus, a response variable \(Y\) might be related to an explanatory variable \(X\) via \[Y = a \times X + b.\] Such a model is called a linear model.

In statistics, of course, \(X\) and \(Y\) are random variables and we don’t expect the model to match the data exactly; rather, we ask how confident we are in the model given the errors that we see.

An introductory example

Here’s an example lifted straight from our textbook which is based on a 1995 paper entitled “Morphological variation among columns of the mountain brushtail possum” which appeared in the Australian Journal of Zoology.

The paper provides data on several morphological feathers of possums from down under, such as head length, total length, and tail length. This data was measured off of 104 possums. I’ve stored it on my website for easy access:

df = read.csv('https://www.marksmath.org/data/possum.txt', sep='\t')
df$totalL = 10*df$totalL # Express in mm, like the others
head(df)
##   site pop sex age headL skullW totalL tailL
## 1    1 Vic   m   8  94.1   60.4    890  36.0
## 2    1 Vic   f   6  92.5   57.6    915  36.5
## 3    1 Vic   f   6  94.0   60.0    955  39.0
## 4    1 Vic   f   6  93.2   57.1    920  38.0
## 5    1 Vic   f   2  91.5   56.3    855  36.0
## 6    1 Vic   f   1  93.1   54.8    905  35.5

Let’s generate a scatter plot of head length (in mm) vs total length (in cm).

plot(df$totalL, df$headL)

Not surprisingly, the head length generally increases with overall length, though there is certainly some variation.

We might try to fit the data with a linear model using R’s lm command:

fit = lm(headL ~ totalL, data=df)
fit
## 
## Call:
## lm(formula = headL ~ totalL, data = df)
## 
## Coefficients:
## (Intercept)       totalL  
##    42.70979      0.05729

The output above indicates that the line \(y=0.05729\times x + 42.7098\) might be a semi, kinda, almost OK fit to the data. R makes it easy to add this line to the plot.

plot(df$totalL, df$headL)
abline(fit)

While it can’t pass through all the data points, that line is the best possible line in a quantitative sense known as “least squares”.

There are a number of statistical measures of whether the relationship is close to linear or not. Perhaps the simplest is called “correlation”.

cor(df$totalL, df$headL)
## [1] 0.6910937

Illustrations

Here are a few scatter plots of randomly generated data to illustrate the ideas.

A perfect linear relationship

A close to linear relationship

A close to, but negative, linear relationship

A nonlinear relationship

No relationship

Conditions for linear regression:

Correlation

Correlation always takes values between \(\pm 1\) and describes the strength of the linear relationship between two variables.

We can compute the correlation for observations \((x_1, y_1)\), \((x_2, y_2)\), …, \((x_n, y_n)\) using the formula \[ R = \frac{1}{n-1}\sum_{i=1}^{n} \frac{x_i-\bar{x}}{s_x}\frac{y_i-\bar{y}}{s_y} \] where \(\bar{x}\), \(\bar{y}\), \(s_x\), and \(s_y\) are the sample means and standard deviations for each variable.

Residuals

One way to think about observed data is as: \[\text{Data} = \text{Model} + \text{Residual}.\] Thus, the residual is the difference between your observation and what you expect to see.

Sometimes, plotting the residuals of a scatter plot can be illuminating.

The least squares fit

If we have observations \[(x_1, y_1),(x_2, y_2),\ldots,(x_n, y_n)\] and we assume that our model is \(Y=aX+b\), then we can attempt to find \(a\) and \(b\) so that \[(y_1-(ax_1+b))^2 + (y_2-(ax_2+b))^2 + \cdots + (y_n-(ax_n+b))^2\] is as small as possible.

Put another way, we can minimize the sum of the squares of the residuals.

Summary statistics for a fit

Recall the data frame involving possums df defined above. Let’s take a closer look at the fit defined by the linear model.

fit = lm(headL ~ totalL, data=df)
summary(fit)
## 
## Call:
## lm(formula = headL ~ totalL, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1877 -1.5340 -0.3345  1.2788  7.3968 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.709793   5.172814   8.257 5.66e-13 ***
## totalL       0.057290   0.005933   9.657 4.68e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.595 on 102 degrees of freedom
## Multiple R-squared:  0.4776, Adjusted R-squared:  0.4725 
## F-statistic: 93.26 on 1 and 102 DF,  p-value: 4.681e-16

Well there’s a lot going on there. Here’s a little rundown.

Residuals

The summary of the fit has several pieces of information on the residuals. There are 104 of these - one for each possum in our data set. The Residuals line in the output tells gives us information on the quantiles.

We might actually be interested in a QQ-plot of the residuals because they’re supposed to be normal for regression to work. We can access the residuals via fit$residuals.

qqpic = qqnorm(fit$residuals)

Well, that looks pretty linear. In fact, we can compute a correlation associated with that picture.

cor(qqpic$x,qqpic$y)
## [1] 0.9876536

Outstanding!

Note that the last line tells us the standard deviation associated with the residuals.

sd(fit$residuals)
## [1] 2.582693

Coefficients

The coefficients are simply the values of \(a\) and \(b\) in the least squares fit, \(Y=aX+b\).

They, however, are random variables themselves; if we performed the same process with a different sample, we’d expect to get a similar but slightly different result.

Thus, there is a standard error associated with each coefficient.

We can use the coefficients to make predictions. The coefficents indicate that the head length (in mm) \(h\) should be should be related to the total length (in cm) \(t\) via \[h(t) = 0.05729\,t + 42.70979.\] For example, the head length of a possum of total length 100 could be expected to be about \[5.729 + 42.70979 \approx 48.44\]

Another example

Here’s another, more condensed example that’s also taken from our text. The data is a list of 50 freshman from Elmhurst College with infomration concerning family income and financial aid received. We might expect a relationship. As usual, we can grab the data off of our website:

df = read.csv('https://www.marksmath.org/data/elmhurst.txt', sep = '\t')
head(df)
##   family_income gift_aid price_paid
## 1        92.922    21.72      14.28
## 2         0.250    27.47       8.53
## 3        53.092    27.75      14.25
## 4        50.200    27.22       8.78
## 5       137.613    18.00      24.00
## 6        47.957    18.52      23.48

Here’s a summary of a fit of the data:

fit = lm(gift_aid ~ family_income, data = df)
summary(fit)
## 
## Call:
## lm(formula = gift_aid ~ family_income, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1128  -3.6234  -0.2161   3.1587  11.5707 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   24.31933    1.29145  18.831  < 2e-16 ***
## family_income -0.04307    0.01081  -3.985 0.000229 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.783 on 48 degrees of freedom
## Multiple R-squared:  0.2486, Adjusted R-squared:  0.2329 
## F-statistic: 15.88 on 1 and 48 DF,  p-value: 0.0002289

And a plot:

plot(df$family_income, df$gift_aid)
abline(fit)

The summary indicates that the data is modeled by \[Y = 24.31933 - 0.04307\,X.\] Thus, we might expect a family making \(\$100,000\) a year to recieve \[Y = 24319.33 - 43.07\times100 = 20012\]

in aid.