Linear Models

Wed, Nov 20, 2024

Linear Models and Regression

Sometimes, two numerical variables have a noticeable relationship. Regression is a tool to help find and test the strength that relationship. In the simplest case, the relationship might be linear so we apply regression to a linear model.

This material is detailed in Chapter 8 of our text.

Linear Models

One of the simplest types of relationships between two variables is a linear relationship - say, \[Y = aX+b.\] In statistics, \(X\) and \(Y\) are typically random variables so we might ask questions like:

Is there really a linear relationship between \(X\) and \(Y\)?
If so, what are the values of the parameters \(a\) and \(b\)?
What kind of predictions can we make using this relationship?
How confident can be in those predictions?

Possum data from the text

I’ve got biometric data on 104 possums that’s used in our textbook. Here are the first 12 rows of that data:

site	pop	sex	age	headL	skullW	totalL	tailL
1	Vic	m	8.0	94.1	60.4	89.0	36.0
1	Vic	f	6.0	92.5	57.6	91.5	36.5
1	Vic	f	6.0	94.0	60.0	95.5	39.0
1	Vic	f	6.0	93.2	57.1	92.0	38.0
1	Vic	f	2.0	91.5	56.3	85.5	36.0
1	Vic	f	1.0	93.1	54.8	90.5	35.5
1	Vic	m	2.0	95.3	58.2	89.5	36.0
1	Vic	f	6.0	94.8	57.6	91.0	37.0
1	Vic	f	9.0	93.4	56.3	91.5	37.0
1	Vic	f	6.0	91.8	58.0	89.5	37.5
1	Vic	f	9.0	93.3	57.2	89.5	39.0
1	Vic	f	5.0	94.9	55.6	92.0	35.5

A scatter plot

Here’s a scatter plot of that data relating head length to total length for those possums.

The regression line

Here’s the so-called “regression line” that models the data. We’ll need a little facility at dealing with formulae of the form \(y=ax+b\) to be able to understand and apply regression lines.

Statistics in college football

Here’s one other cute example illustrating how we might apply linear regression.

Correlation

Correlation, denoted with an \(r\), measures the strength of the linear relationship between two variables. The correlation is always between −1 and +1 and

A number close to +1 indicates a strong, positive linear relationship,
A number close to −1 indicates a strong, negative linear relationship,
A number close 0 indicates a weak linear relationship.

You often see the quantity \(r^2\) that can be interpreted as the proportion of the variance of a dependent variable that’s explained by an explanatory variable.

Interpreting correlation

You can get a sense of how correlation works using the toy below.

Lines in the plane

Linear regression is all about approximating clusters of points with lines. To use a bit more precise terminology, we might say

Linear regression concerns modeling data with linear equations.

As such, it’s important that we understand the algebraic form of these equations and how to use them to do a few relatively simple operations.

Algebraic form

A line is the graph of an equation of the form \(y=ax+b\).

The defining characteristic of such a graph is that its slope is constant, i.e. if

\[y_1=ax_1+b \text{ and } y_2 = ax_2+b\]

so that the points \((x_1,y_1)\) and \((x_2,y_2)\) are on the line, then

\[\frac{\Delta y}{\Delta x} = \frac{y_2-y_1}{x_2-x_1} = \frac{(ax_2+b)-(ax_1+b)}{x_2-x_1} = \frac{a(x_2-x_1)}{x_2-x_1} = a.\]

While that might seem complicated, it ultimately makes it easy to plot the line.

Plotting a line

We can plot a line simply by plugging in a couple of points.

For example, to graph \(y=\frac{1}{2}x+1\), plug in \(x=0\) to get \(y=1\) - that’s one point! Then plug in \(x=2\) to get \(y=2\) - that’s another point!

Finally, draw the line through both of those.

An interactive plot

Here’s a fun tool to see how the coefficients \(a\) and \(b\) affect the graph of \(y=ax+b\).

Using a regression line

In statistics, the formula for a regression line will often be generated via software; we need to be able to use that formula and interpret the results. Thus it will be important for you to be able to plug a value like \(x=1.234\) in to a line like

\[y=-5.8335x+0.8408\]

to get a value. To do so, simply plug that value of \(x\) in to get the corresponding value of \(y\):

\[-5.8335\times1.234+0.8408 = −6.357739.\]

Of course, this is done on your calculator or computer.

Possums revisited

The possum example comes right from section 8.1 of our text. I’ve downloaded their data and placed it on my webspace so that we can access it easily using Python’s Pandas library like so:

Code

import pandas as pd
possums = pd.read_csv(
  'https://www.marksmath.org/data/possum.txt',
  sep='\t'
)
possums

	site	pop	sex	age	headL	skullW	totalL	tailL
0	1	Vic	m	8.0	94.1	60.4	89.0	36.0
1	1	Vic	f	6.0	92.5	57.6	91.5	36.5
2	1	Vic	f	6.0	94.0	60.0	95.5	39.0
3	1	Vic	f	6.0	93.2	57.1	92.0	38.0
4	1	Vic	f	2.0	91.5	56.3	85.5	36.0
...	...	...	...	...	...	...	...	...
99	7	other	m	1.0	89.5	56.0	81.5	36.5
100	7	other	m	1.0	88.6	54.7	82.5	39.0
101	7	other	f	6.0	92.4	55.0	89.0	38.0
102	7	other	m	4.0	91.5	55.2	82.5	36.5
103	7	other	f	3.0	93.6	59.9	89.0	40.0

104 rows × 8 columns

The scatter plot

There’s another library called Plotly that makes it easy to generate a scatter plot of the data together with the regression line:

Code

import plotly.express as px
possum_scatter2 = px.scatter(possums,
  x = 'headL', y = 'totalL',
  hover_data = ['site','pop','sex','age'],
  trendline='ols', trendline_color_override = 'black',
  width = 800, height = 500
)
possum_scatter2.show(config={'displayModeBar': False})

Regression analysis

We need more than just a groovy plot; we need to run a regression analysis to get specific information about the data including:

A formula of the form \(y=ax+b\) for the regression line,
The correlation \(r\) indicating the strength of linear relationship between the variables, and
The \(p\) value indicating how likely that there’s no linear relationship between the variables.

We’re going to use Python’s statsmodels library to this type of analysis.

Running a regression analysis

Here are the results of the regression analysis relating total possum length to head length. Note that you can see the complete code that generates this by pressing the “▶ Code” button.

Code

# Import the statsmodels and pandas libraries:
import statsmodels.api as sm
import pandas as pd

# Read the data:
possums = pd.read_csv(
  'https://www.marksmath.org/data/possum.txt',
  sep='\t'
)

# Define the dependent variable Y and independent variable X:
Y = possums['totalL']
X = possums['headL']
X = sm.add_constant(X)

# Setup and fit the Ordinary Least Square model
model = sm.OLS(Y,X)
results = model.fit()

# Display the results
results.summary(slim=True)

OLS Regression Results
Dep. Variable:	totalL	R-squared:	0.478
Model:	OLS	Adj. R-squared:	0.472
No. Observations:	104	F-statistic:	93.26
Covariance Type:	nonrobust	Prob (F-statistic):	4.68e-16

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	9.8882	8.000	1.236	0.219	-5.980	25.757
headL	0.8337	0.086	9.657	0.000	0.662	1.005

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.42e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Interpretting the regression results

There are really just a couple items in that table that we need to worry about

The coefficients indicated by the variable name headL and constant const and
The squared correlation \(r^2\) indicated by R-squared.

Regression results table

It’s probably worth mentioning that the \(p\)-value indicated by Prob (F-statistics) is used for running hypothesis tests. We won’t worry about that with our shortened semester, though.

The coefficients

Our regression line has the form \(y=ax+b\) or, adapted more precisely to this example,

\[\mathtt{totalL} = a \times \mathtt{headL} + b.\]

Note that \(a\) and \(b\) are precisely the headL and const values in the coef column in the lower left corner of the regression results:

Coefficient table

That is the formula relating totalL and headL is

\[\mathtt{totalL} = 0.8337 \times \mathtt{headL} + 9.8882.\]

Prediction

One major use of regression is for prediction. For example, the regression analysis for the possum data might indicate that a possum with a head length of 101 mm should have an overall length of just over 94 cm, since (as indicated by the red dot)

\[0.8337 \times 101 + 9.8882 = 94.0919.\]

Correlation

The correlation can be inferred from the value of \(r^2\) returned in the table.

Correlation

Since the table indicates that \(r^2 = 0.478\), I guess that

\[|r| = \sqrt{0.478} \approx 0.691375.\]

Note that you do need to consider the possibility that \(r\) is negative, though it clearly is not in this case. Either way, \(|r\) indicates the strength of the linear relationship between the variables, which is pretty strong in this case.

What is the regression line?

Given a scatter plot, it seems that there are a lot of lines that might, kinda match the data. Which one should be the so-called best fit?

Theory

Well, I guess we need some details and a metric - i.e. a concrete measure of what best means.

We’ve got a list of data points \[\{(x_1,y_1),(x_2,y_2),\ldots,(x_n,y_n)\}.\] The idea is to choose values of \(a\) and \(b\) that minimize \[\sum_{i=1}^n (y_i - (ax_i+b))^2.\] Why?? Because \(y_i - (ax_i+b)\) represents the difference between the actual value of the data point and the value that’s predicted by the linear model. We square those differences (to make them all positive) and add them up to get a measure of how far off the estimate is. We want to minimize that total squared error.

Illustration

Fiddle with the sliders to select slope and intercept of the red line \(y=mx+c\) and minimize the value of \(T\) shown in the lower right. You should find that the red line needs to match the green regression line to achieve that minimum.

Another example

We should take a look at one more example so, I figure it might as well be college football stats from 2014!

So, we’re going to

Download the data,
take a look at it as a table and as a scatter plot,
so we can run a regression analysis, to
- Find out how well total points and winning percentage are correlated and
- Predict your winning percentage if you score 500 points for the season.

The data

I’ve got the (abridged) data on my webspace, of course:

Code

import pandas as pd
cfb2014 = pd.read_csv(
  'https://marksmath.org/data/cfb_stats_2014.csv'
)
cfb2014.head()

	Team	Wins	Losses	WL%	TEAM Total Points	TEAM Total Offense Yards	TEAM Total Offense Yards / Play	OPP Scoring Points/Game	OPP Rushing Yards / Attempt	Turnover margin
0	Cincinnati	9	4	0.692308	442	5982	6.33	27.2	4.77	2
1	Connecticut	2	10	0.166667	186	3315	4.50	29.8	4.23	-13
2	East Carolina	8	5	0.615385	466	6929	6.48	25.8	3.32	-4
3	Houston	8	5	0.615385	387	5383	5.70	20.6	3.70	8
4	Memphis	10	3	0.769231	471	5552	5.49	19.5	3.40	11

The scatter plot

Here’s a look at the scatter plot and regression line:

Code

import plotly.express as px
cfb_scatter2 = px.scatter(cfb2014,
  x = 'TEAM Total Points', y = 'WL%',
  hover_name = "Team",
  trendline='ols', trendline_color_override = 'black',
  width = 800, height = 500
)
cfb_scatter2.show(config={'displayModeBar': False})

Analysis

Here’s the regression analysis:

Code

# Import the statsmodels and pandas libraries:
import statsmodels.api as sm

# Define the dependent variable Y and independent variable X:
Y = cfb2014['WL%']
X = cfb2014['TEAM Total Points']
X = sm.add_constant(X)

# Setup and fit the Ordinary Least Square model
model = sm.OLS(Y,X)
results = model.fit()

# Display the results
results.summary(slim=True)

OLS Regression Results
Dep. Variable:	WL%	R-squared:	0.697
Model:	OLS	Adj. R-squared:	0.695
No. Observations:	128	F-statistic:	290.3
Covariance Type:	nonrobust	Prob (F-statistic):	1.69e-34

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-0.0924	0.037	-2.474	0.015	-0.166	-0.018
TEAM Total Points	0.0016	9.59e-05	17.039	0.000	0.001	0.002

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.4e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Correlation

The regression analysis indicates that \(r^2 = 0.697\).

I guess that means that \(r=\sqrt{0.697} \approx 0.834865\), which indicates a very strong linear relationship!

Prediction

In the lower left portion of the results table, we see:

const	-0.0924
Points	0.0016

Thus, the formula predicting WL% from Points is \[\text{WL%} = 0.0016\times\text{Points} - 0.0925.\]

To determine our expected win-loss percentage if we score 500 points throughout the season, we simply plug in 500 for Points to get:

\[0.0016\times500 - 0.0925 = 0.7075\]

for a win-loss percentage of just over 70%.

Copy-paste

Let’s take a look at the online HW in MyOpenMath - particularly, problem #3 which asks you to compute correlation for data stored in an HTML table. I’ve set up a Colab notebook to help with that.