Linear Models and Regression¶

Sometimes, two data sets are related. Regression is a tool to help find and test the strength that relationship. In the simplest case, the relationship might be linear so we apply regression to a linear model.

This material is detailed in Chapter 8 of our text.

Linear Models¶

One of the simplest types of relationships between two variables is a linear relationship - say, $$Y = aX+b.$$ In statistics, $X$ and $Y$ are typically random variables so we might ask questions like:

Is there really a linear relationship between $X$ and $Y$?
If so, what are the values of the parameters $a$ and $b$?
What kind of predictions can we make using this relationship?
How confident can be in those predictions?

Relating head length and total length¶

Here's the plot of an example right from our text that relates the head length to total length for 104 possums.

The regression line¶

Here's the so-called "regression line" that models the data. We might infer from this line that a possum with a head length of 101 mm would have a total lenght of about 94 cm.

Lines in the plane¶

A line is the graph of an equation of the form $y=ax+b$.

The defining characteristic of such a graph is that it's slope is constant, i.e. if $y_1=ax_1+b$ and $y_2 = ax_2+b$, then

$$\frac{\Delta y}{\Delta x} = \frac{y_2-y_1}{x_2-x_1} = \frac{(ax_2+b)-(ax_1+b)}{x_2-x_1} = \frac{a(x_2-x_1)}{x_2-x_1} = a.$$

While that might seem complicated, it ultimately makes it easy to plot the line.

Plotting a line¶

We can plot a line simply by plugging in a couple of points.

For example, to graph $y=2x+1$, plug in $x=0$ to get $y=0$ - that's one point!

Then plug in $x=1$ to get $y=3$ - that's another point!

Draw the line through both of those.

The plot¶

Here's the plot of $y=2x+1$.

An interactive plot¶

Here's a fun tool to see how the coefficients $a$ and $b$ affect the graph of $y=ax+b$.

Using a regression line¶

In statistics, the forumula for a line will often be generated via software; you just need to interpret it. Thus it will be important for you to be able to plug a value like $x=1.234$ in to a line like $y=-5.8335x+0.8408$ to get a value. In this case: $$-5.8335\times1.234+0.8408 = −6.357739.$$

Possums revisited¶

The possum example comes right from section 8.1 of our text. I've got the data stored on my website so we can read it in and run a regression analysis like so:

import pandas as pd
from scipy.stats import linregress
df = pd.read_csv('https://www.marksmath.org/data/possum.txt', sep='\t')

regression = linregress(df.headL, df.totalL)
regression

LinregressResult(slope=0.8336697990278819, intercept=9.888233331751707, rvalue=0.6910936973935056, pvalue=4.680578654379419e-16, stderr=0.08632851506979797)

The output¶

regression

LinregressResult(slope=0.8336697990278819, intercept=9.888233331751707, rvalue=0.6910936973935056, pvalue=4.680578654379419e-16, stderr=0.08632851506979797)

There are five features in this output. Of immediate importance are the slope and intercept, which describe the regression line. Those are basically the coefficients $a$ and $b$ we've been talking about. Thus, the line $$y = 0.8336697990278819x + 9.888233331751707$$ should be a reasonably good fit to the data. If we plug $x=101$ in, we should get about $y=94$ out:

0.8336697990278819*101 + 9.888233331751707

94.08888303356777

Another look at the picture¶

Correlation¶

Let's take on more look at the output of that regression:

regression

LinregressResult(slope=0.8336697990278819, intercept=9.888233331751707, rvalue=0.6910936973935056, pvalue=4.680578654379419e-16, stderr=0.08632851506979797)

In addition to slope and intercept we see

pvalue, which is a $p$-value for a hypothesis test that we'll talk about soon,
stderr, which seems like it should be related to the $p$-value, and
rvalue - ???

The "rvalue" is the the correlation between the two variables which is a measure of the strength of the linear relationship between them.