Regression is a technique for determining if there is a relationship among variables and, if so, estimating what that relationship might be.
In the simplest case, we might hope that the relationship is linear; thus, a response variable \(Y\) might be related to an explanatory variable \(X\) via \[Y = a \times X + b.\] Such a model is called a linear model.
In statistics, of course, \(X\) and \(Y\) are random variables and we don’t expect the model to match the data exactly; rather, we ask how confident we are in the model given the errors that we see.
Here’s an example lifted straight from our textbook which is based on a 1995 paper entitled “Morphological variation among columns of the mountain brushtail possum” which appeared in the Australian Journal of Zoology.
The paper provides data on several morphological feathers of possums from down under, such as head length, total length, and tail length. This data was measured off of 104 possums. I’ve stored it on my website for easy access:
df = read.csv('https://www.marksmath.org/data/possum.txt', sep='\t')
df$totalL = 10*df$totalL # Express in mm, like the others
head(df)
## site pop sex age headL skullW totalL tailL
## 1 1 Vic m 8 94.1 60.4 890 36.0
## 2 1 Vic f 6 92.5 57.6 915 36.5
## 3 1 Vic f 6 94.0 60.0 955 39.0
## 4 1 Vic f 6 93.2 57.1 920 38.0
## 5 1 Vic f 2 91.5 56.3 855 36.0
## 6 1 Vic f 1 93.1 54.8 905 35.5
Let’s generate a scatter plot of head length (in mm) vs total length (in cm).
plot(df$totalL, df$headL)
Not surprisingly, the head length generally increases with overall length, though there is certainly some variation.
We might try to fit the data with a linear model using R’s lm
command:
fit = lm(headL ~ totalL, data=df)
fit
##
## Call:
## lm(formula = headL ~ totalL, data = df)
##
## Coefficients:
## (Intercept) totalL
## 42.70979 0.05729
The output above indicates that the line \(y=0.05729\times x + 42.7098\) might be a semi, kinda, almost OK fit to the data. R makes it easy to add this line to the plot.
plot(df$totalL, df$headL)
abline(fit)
While it can’t pass through all the data points, that line is the best possible line in a quantitative sense known as “least squares”.
There are a number of statistical measures of whether the relationship is close to linear or not. Perhaps the simplest is called “correlation”.
cor(df$totalL, df$headL)
## [1] 0.6910937
Here are a few scatter plots of randomly generated data to illustrate the ideas.
Correlation always takes values between \(\pm 1\) and describes the strength of the linear relationship between two variables.
We can compute the correlation for observations \((x_1, y_1)\), \((x_2, y_2)\), …, \((x_n, y_n)\) using the formula \[ R = \frac{1}{n-1}\sum_{i=1}^{n} \frac{x_i-\bar{x}}{s_x}\frac{y_i-\bar{y}}{s_y} \] where \(\bar{x}\), \(\bar{y}\), \(s_x\), and \(s_y\) are the sample means and standard deviations for each variable.
One way to think about observed data is as: \[\text{Data} = \text{Model} + \text{Residual}.\] Thus, the residual is the difference between your observation and what you expect to see.
Sometimes, plotting the residuals of a scatter plot can be illuminating.
If we have observations \[(x_1, y_1),(x_2, y_2),\ldots,(x_n, y_n)\] and we assume that our model is \(Y=aX+b\), then we can attempt to find \(a\) and \(b\) so that \[(y_1-(ax_1+b))^2 + (y_2-(ax_2+b))^2 + \cdots + (y_n-(ax_n+b))^2\] is as small as possible.
Put another way, we can minimize the sum of the squares of the residuals.
Recall the data frame involving possums df
defined above. Let’s take a closer look at the fit defined by the linear model.
fit = lm(headL ~ totalL, data=df)
summary(fit)
##
## Call:
## lm(formula = headL ~ totalL, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1877 -1.5340 -0.3345 1.2788 7.3968
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.709793 5.172814 8.257 5.66e-13 ***
## totalL 0.057290 0.005933 9.657 4.68e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.595 on 102 degrees of freedom
## Multiple R-squared: 0.4776, Adjusted R-squared: 0.4725
## F-statistic: 93.26 on 1 and 102 DF, p-value: 4.681e-16
Well there’s a lot going on there. Here’s a little rundown.
The summary of the fit has several pieces of information on the residuals. There are 104 of these - one for each possum in our data set. The Residuals line in the output tells gives us information on the quantiles.
We might actually be interested in a QQ-plot of the residuals because they’re supposed to be normal for regression to work. We can access the residuals via fit$residuals
.
qqpic = qqnorm(fit$residuals)
Well, that looks pretty linear. In fact, we can compute a correlation associated with that picture.
cor(qqpic$x,qqpic$y)
## [1] 0.9876536
Outstanding!
Note that the last line tells us the standard deviation associated with the residuals.
sd(fit$residuals)
## [1] 2.582693
The coefficients are simply the values of \(a\) and \(b\) in the least squares fit, \(Y=aX+b\).
They, however, are random variables themselves; if we performed the same process with a different sample, we’d expect to get a similar but slightly different result.
Thus, there is a standard error associated with each coefficient.
We can use the coefficients to make predictions. The coefficents indicate that the head length (in mm) \(h\) should be should be related to the total length (in cm) \(t\) via \[h(t) = 0.05729\,t + 42.70979.\] For example, the head length of a possum of total length 100 could be expected to be about \[5.729 + 42.70979 \approx 48.44\]
Here’s another, more condensed example that’s also taken from our text. The data is a list of 50 freshman from Elmhurst College with infomration concerning family income and financial aid received. We might expect a relationship. As usual, we can grab the data off of our website:
df = read.csv('https://www.marksmath.org/data/elmhurst.txt', sep = '\t')
head(df)
## family_income gift_aid price_paid
## 1 92.922 21.72 14.28
## 2 0.250 27.47 8.53
## 3 53.092 27.75 14.25
## 4 50.200 27.22 8.78
## 5 137.613 18.00 24.00
## 6 47.957 18.52 23.48
Here’s a summary of a fit of the data:
fit = lm(gift_aid ~ family_income, data = df)
summary(fit)
##
## Call:
## lm(formula = gift_aid ~ family_income, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.1128 -3.6234 -0.2161 3.1587 11.5707
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.31933 1.29145 18.831 < 2e-16 ***
## family_income -0.04307 0.01081 -3.985 0.000229 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.783 on 48 degrees of freedom
## Multiple R-squared: 0.2486, Adjusted R-squared: 0.2329
## F-statistic: 15.88 on 1 and 48 DF, p-value: 0.0002289
And a plot:
plot(df$family_income, df$gift_aid)
abline(fit)
The summary indicates that the data is modeled by \[Y = 24.31933 - 0.04307\,X.\] Thus, we might expect a family making \(\$100,000\) a year to recieve \[Y = 24319.33 - 43.07\times100 = 20012\]
in aid.