Last time, we talked about scatter plots of paired data sets and correlation. Today, we’ll push that farther into regression.
To emphasize the fact that we’re working with paired data, let’s take a look at our book store price example:
Again, this is paired data; each point corresponds to a single book which has two numeric values - it’s bkstr.com price and it’s amazon.com price.
We talked about the correlation \(R\) before. Now we’ll focus on the regression line \(A = 0.68\times B + 3.18\), which is supposedly the “best linear fit” for the data. Specifically, we’ll address
What makes the regression line better than some other line?
Answer: It minimizes the total squared error.
More specifically, if we have observations \[(x_1, y_1),(x_2, y_2),\ldots,(x_n, y_n)\] and we assume that our model is \(Y=aX+b\), then the regression line chooses \(a\) and \(b\) so that \[(y_1-(ax_1+b))^2 + (y_2-(ax_2+b))^2 + \cdots + (y_n-(ax_n+b))^2\] is as small as possible.
The terms \(y_i-(ax_i+b)\) are the called the residuals; they measure how far off our regression model is at the data points.
Often, examining the residuals of a scatter plot can be illuminating. Here are several scatter plots together with scatter plots of their residuals. The scatter plot on the left below looks slightly non-linear; the scatter plot of the residuals make this much more clear.
It can also be illuminating to look at a histogram of the residuals. The histogram below shows the distribution of the residuals for our weight estimate example. Since it appears to be normal, we can apply our knowledge of the normal distribution to obtain confidence in our estimates.
First and foremost, linear regression might be useful if you want to understand the relationship between two numerical variables. Furthermore, the data should satisfy a few more conditions:
It seems that correlation should be connected with slope - and it is, via the formula \[m = r\frac{s_y}{s_x}.\] Furthermore, the regression line should go through the point \((\bar{x},\bar{y})\). These formulae make it relatively easy to compute a regression line for given data.
Suppose that \(X\) is a data set with mean \(90\) and standard deviation \(5\); \(Y\) is a data set with mean \(74\) and standard deviation \(4\). Furthermore, \(X\) and \(Y\) have a tight correlation of \(r=0.85\).
Solution
We know the regression line has the form \(y=mx+b\). The slope is \[m = r\frac{s_y}{s_x} = 0.85 \times \frac{4}{5} = 0.68.\] We now know the regression line has the more specific form \(y=0.68x+b\). Once we know this, we can plug the means for \(X\) and \(Y\) to get \(b\): \[74 = 0.68 \times 90 + b, \] so \[b = 74 - 0.68 \times 90 = 12.8.\] Thus, the final equation of the line is \[y = 0.68x +12.8.\]
To solve the second part, we simply plug \(x=85\) into our formula for the line to find the corresponding value of \(y\).
Let’s discuss how we might interpret the following:
set.seed(1)
cdc = read.csv("https://www.marksmath.org/data/cdc.csv")
men = subset(cdc, gender=='m')
subset = men[sample(1:length(men$height),50),]
cdc_fit = lm(subset$weight~subset$height)
summary(cdc_fit)
##
## Call:
## lm(formula = subset$weight ~ subset$height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.120 -23.716 -8.848 17.896 93.392
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -122.182 144.958 -0.843 0.4035
## subset$height 4.504 2.043 2.205 0.0323 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.3 on 48 degrees of freedom
## Multiple R-squared: 0.09198, Adjusted R-squared: 0.07307
## F-statistic: 4.862 on 1 and 48 DF, p-value: 0.03227