Sometimes, two numerical variables have a noticeable relationship. Regression is a tool to help find and test the strength that relationship. In the simplest case, the relationship might be linear so we apply regression to a linear model.
This material is detailed in Chapter 8 of our text.
Linear Models
One of the simplest types of relationships between two variables is a linear relationship - say, \[Y = aX+b.\] In statistics, \(X\) and \(Y\) are typically random variables so we might ask questions like:
Is there really a linear relationship between \(X\) and \(Y\)?
If so, what are the values of the parameters \(a\) and \(b\)?
What kind of predictions can we make using this relationship?
How confident can be in those predictions?
Possum data from the text
I’ve got biometric data on 104 possums that’s used in our textbook. Here are the first 12 rows of that data:
site
pop
sex
age
headL
skullW
totalL
tailL
1
Vic
m
8.0
94.1
60.4
89.0
36.0
1
Vic
f
6.0
92.5
57.6
91.5
36.5
1
Vic
f
6.0
94.0
60.0
95.5
39.0
1
Vic
f
6.0
93.2
57.1
92.0
38.0
1
Vic
f
2.0
91.5
56.3
85.5
36.0
1
Vic
f
1.0
93.1
54.8
90.5
35.5
1
Vic
m
2.0
95.3
58.2
89.5
36.0
1
Vic
f
6.0
94.8
57.6
91.0
37.0
1
Vic
f
9.0
93.4
56.3
91.5
37.0
1
Vic
f
6.0
91.8
58.0
89.5
37.5
1
Vic
f
9.0
93.3
57.2
89.5
39.0
1
Vic
f
5.0
94.9
55.6
92.0
35.5
A scatter plot
Here’s a scatter plot of that data relating head length to total length for those possums.
The regression line
Here’s the so-called “regression line” that models the data. We’ll need a little facility at dealing with formulae of the form \(y=ax+b\) to be able to understand and apply regression lines.
Statistics in college football
Here’s one other cute example illustrating how we might apply linear regression.
Correlation
Correlation, denoted with an \(r\), measures the strength of the linear relationship between two variables. The correlation is always between −1 and +1 and
A number close to +1 indicates a strong, positive linear relationship,
A number close to −1 indicates a strong, negative linear relationship,
A number close 0 indicates a weak linear relationship.
You often see the quantity \(r^2\) that can be interpreted as the proportion of the variance of a dependent variable that’s explained by an explanatory variable.
Interpreting correlation
You can get a sense of how correlation works using the toy below.
Lines in the plane
Linear regression is all about approximating clusters of points with lines. To use a bit more precise terminology, we might say
Linear regression concerns modeling data with linear equations.
As such, it’s important that we understand the algebraic form of these equations and how to use them to do a few relatively simple operations.
Algebraic form
A line is the graph of an equation of the form \(y=ax+b\).
The defining characteristic of such a graph is that its slope is constant, i.e. if
\[y_1=ax_1+b \text{ and } y_2 = ax_2+b\]
so that the points \((x_1,y_1)\) and \((x_2,y_2)\) are on the line, then
While that might seem complicated, it ultimately makes it easy to plot the line.
Plotting a line
We can plot a line simply by plugging in a couple of points.
For example, to graph \(y=\frac{1}{2}x+1\), plug in \(x=0\) to get \(y=1\) - that’s one point! Then plug in \(x=2\) to get \(y=2\) - that’s another point!
Finally, draw the line through both of those.
An interactive plot
Here’s a fun tool to see how the coefficients \(a\) and \(b\) affect the graph of \(y=ax+b\).
Using a regression line
In statistics, the formula for a regression line will often be generated via software; we need to be able to use that formula and interpret the results. Thus it will be important for you to be able to plug a value like \(x=1.234\) in to a line like
\[y=-5.8335x+0.8408\]
to get a value. To do so, simply plug that value of \(x\) in to get the corresponding value of \(y\):
\[-5.8335\times1.234+0.8408 = −6.357739.\]
Of course, this is done on your calculator or computer.
Possums revisited
The possum example comes right from section 8.1 of our text. I’ve downloaded their data and placed it on my webspace so that we can access it easily using Python’s Pandas library like so:
Code
import pandas as pdpossums = pd.read_csv('https://www.marksmath.org/data/possum.txt', sep='\t')possums
site
pop
sex
age
headL
skullW
totalL
tailL
0
1
Vic
m
8.0
94.1
60.4
89.0
36.0
1
1
Vic
f
6.0
92.5
57.6
91.5
36.5
2
1
Vic
f
6.0
94.0
60.0
95.5
39.0
3
1
Vic
f
6.0
93.2
57.1
92.0
38.0
4
1
Vic
f
2.0
91.5
56.3
85.5
36.0
...
...
...
...
...
...
...
...
...
99
7
other
m
1.0
89.5
56.0
81.5
36.5
100
7
other
m
1.0
88.6
54.7
82.5
39.0
101
7
other
f
6.0
92.4
55.0
89.0
38.0
102
7
other
m
4.0
91.5
55.2
82.5
36.5
103
7
other
f
3.0
93.6
59.9
89.0
40.0
104 rows × 8 columns
The scatter plot
There’s another library called Plotly that makes it easy to generate a scatter plot of the data together with the regression line:
Code
import plotly.express as pxpossum_scatter2 = px.scatter(possums, x ='headL', y ='totalL', hover_data = ['site','pop','sex','age'], trendline='ols', trendline_color_override ='black', width =800, height =500)possum_scatter2.show(config={'displayModeBar': False})
Regression analysis
We need more than just a groovy plot; we need to run a regression analysis to get specific information about the data including:
A formula of the form \(y=ax+b\) for the regression line,
The correlation \(r\) indicating the strength of linear relationship between the variables, and
The \(p\) value indicating how likely that there’s no linear relationship between the variables.
We’re going to use Python’s statsmodels library to this type of analysis.
Running a regression analysis
Here are the results of the regression analysis relating total possum length to head length. Note that you can see the complete code that generates this by pressing the “▶ Code” button.
Code
# Import the statsmodels and pandas libraries:import statsmodels.api as smimport pandas as pd# Read the data:possums = pd.read_csv('https://www.marksmath.org/data/possum.txt', sep='\t')# Define the dependent variable Y and independent variable X:Y = possums['totalL']X = possums['headL']X = sm.add_constant(X)# Setup and fit the Ordinary Least Square modelmodel = sm.OLS(Y,X)results = model.fit()# Display the resultsresults.summary(slim=True)
OLS Regression Results
Dep. Variable:
totalL
R-squared:
0.478
Model:
OLS
Adj. R-squared:
0.472
No. Observations:
104
F-statistic:
93.26
Covariance Type:
nonrobust
Prob (F-statistic):
4.68e-16
coef
std err
t
P>|t|
[0.025
0.975]
const
9.8882
8.000
1.236
0.219
-5.980
25.757
headL
0.8337
0.086
9.657
0.000
0.662
1.005
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.42e+03. This might indicate that there are strong multicollinearity or other numerical problems.
Interpretting the regression results
There are really just a couple items in that table that we need to worry about
The coefficients indicated by the variable name headL and constant const and
The squared correlation \(r^2\) indicated by R-squared.
It’s probably worth mentioning that the \(p\)-value indicated by Prob (F-statistics) is used for running hypothesis tests. We won’t worry about that with our shortened semester, though.
The coefficients
Our regression line has the form \(y=ax+b\) or, adapted more precisely to this example,
\[\mathtt{totalL} = a \times \mathtt{headL} + b.\]
Note that \(a\) and \(b\) are precisely the headL and const values in the coef column in the lower left corner of the regression results:
One major use of regression is for prediction. For example, the regression analysis for the possum data might indicate that a possum with a head length of 101 mm should have an overall length of just over 94 cm, since (as indicated by the red dot)
\[0.8337 \times 101 + 9.8882 = 94.0919.\]
Correlation
The correlation can be inferred from the value of \(r^2\) returned in the table.
Since the table indicates that \(r^2 = 0.478\), I guess that
\[|r| = \sqrt{0.478} \approx 0.691375.\]
Note that you do need to consider the possibility that \(r\) is negative, though it clearly is not in this case. Either way, \(|r\) indicates the strength of the linear relationship between the variables, which is pretty strong in this case.
What is the regression line?
Given a scatter plot, it seems that there are a lot of lines that might, kinda match the data. Which one should be the so-called best fit?
Theory
Well, I guess we need some details and a metric - i.e. a concrete measure of what best means.
We’ve got a list of data points \[\{(x_1,y_1),(x_2,y_2),\ldots,(x_n,y_n)\}.\] The idea is to choose values of \(a\) and \(b\) that minimize \[\sum_{i=1}^n (y_i - (ax_i+b))^2.\] Why?? Because \(y_i - (ax_i+b)\) represents the difference between the actual value of the data point and the value that’s predicted by the linear model. We square those differences (to make them all positive) and add them up to get a measure of how far off the estimate is. We want to minimize that total squared error.
Illustration
Fiddle with the sliders to select slope and intercept of the red line \(y=mx+c\) and minimize the value of \(T\) shown in the lower right. You should find that the red line needs to match the green regression line to achieve that minimum.
Another example
We should take a look at one more example so, I figure it might as well be college football stats from 2014!
So, we’re going to
Download the data,
take a look at it as a table and as a scatter plot,
so we can run a regression analysis, to
Find out how well total points and winning percentage are correlated and
Predict your winning percentage if you score 500 points for the season.
The data
I’ve got the (abridged) data on my webspace, of course:
Code
import pandas as pdcfb2014 = pd.read_csv('https://marksmath.org/data/cfb_stats_2014.csv')cfb2014.head()
Team
Wins
Losses
WL%
TEAM Total Points
TEAM Total Offense Yards
TEAM Total Offense Yards / Play
OPP Scoring Points/Game
OPP Rushing Yards / Attempt
Turnover margin
0
Cincinnati
9
4
0.692308
442
5982
6.33
27.2
4.77
2
1
Connecticut
2
10
0.166667
186
3315
4.50
29.8
4.23
-13
2
East Carolina
8
5
0.615385
466
6929
6.48
25.8
3.32
-4
3
Houston
8
5
0.615385
387
5383
5.70
20.6
3.70
8
4
Memphis
10
3
0.769231
471
5552
5.49
19.5
3.40
11
The scatter plot
Here’s a look at the scatter plot and regression line:
Code
import plotly.express as pxcfb_scatter2 = px.scatter(cfb2014, x ='TEAM Total Points', y ='WL%', hover_name ="Team", trendline='ols', trendline_color_override ='black', width =800, height =500)cfb_scatter2.show(config={'displayModeBar': False})
Analysis
Here’s the regression analysis:
Code
# Import the statsmodels and pandas libraries:import statsmodels.api as sm# Define the dependent variable Y and independent variable X:Y = cfb2014['WL%']X = cfb2014['TEAM Total Points']X = sm.add_constant(X)# Setup and fit the Ordinary Least Square modelmodel = sm.OLS(Y,X)results = model.fit()# Display the resultsresults.summary(slim=True)
OLS Regression Results
Dep. Variable:
WL%
R-squared:
0.697
Model:
OLS
Adj. R-squared:
0.695
No. Observations:
128
F-statistic:
290.3
Covariance Type:
nonrobust
Prob (F-statistic):
1.69e-34
coef
std err
t
P>|t|
[0.025
0.975]
const
-0.0924
0.037
-2.474
0.015
-0.166
-0.018
TEAM Total Points
0.0016
9.59e-05
17.039
0.000
0.001
0.002
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.4e+03. This might indicate that there are strong multicollinearity or other numerical problems.
Correlation
The regression analysis indicates that \(r^2 = 0.697\).
I guess that means that \(r=\sqrt{0.697} \approx 0.834865\), which indicates a very strong linear relationship!
Prediction
In the lower left portion of the results table, we see:
const
-0.0924
Points
0.0016
Thus, the formula predicting WL% from Points is \[\text{WL%} = 0.0016\times\text{Points} - 0.0925.\]
To determine our expected win-loss percentage if we score 500 points throughout the season, we simply plug in 500 for Points to get:
\[0.0016\times500 - 0.0925 = 0.7075\]
for a win-loss percentage of just over 70%.
Copy-paste
Let’s take a look at the online HW in MyOpenMath - particularly, problem #3 which asks you to compute correlation for data stored in an HTML table. I’ve set up a Colab notebook to help with that.