Sometimes, two data sets are related. Regression is a tool to help find and test the strength that relationship. In the simplest case, the relationship might be linear so we apply regression to a linear model.
This material is detailed in Chapter 8 of our text.
One of the simplest types of relationships between two variables is a linear relationship - say, Y=aX+b.
Here's the plot of an example right from our text that relates the head length to total length for 104 possums.
Here's the so-called "regression line" that models the data. We might infer from this line that a possum with a head length of 101 mm would have a total lenght of about 94 cm.
A line is the graph of an equation of the form y=ax+b.
The defining characteristic of such a graph is that it's slope is constant, i.e. if y1=ax1+b and y2=ax2+b, then
ΔyΔx=y2−y1x2−x1=(ax2+b)−(ax1+b)x2−x1=a(x2−x1)x2−x1=a.While that might seem complicated, it ultimately makes it easy to plot the line.
We can plot a line simply by plugging in a couple of points.
For example, to graph y=2x+1, plug in x=0 to get y=0 - that's one point!
Then plug in x=1 to get y=3 - that's another point!
Draw the line through both of those.
Here's a fun tool to see how the coefficients a and b affect the graph of y=ax+b.
In statistics, the forumula for a line will often be generated via software; you just need to interpret it. Thus it will be important for you to be able to plug a value like x=1.234 in to a line like y=−5.8335x+0.8408 to get a value. In this case: −5.8335×1.234+0.8408=−6.357739.
The possum example comes right from section 8.1 of our text. I've got the data stored on my website so we can read it in and run a regression analysis like so:
import pandas as pd
from scipy.stats import linregress
df = pd.read_csv('https://www.marksmath.org/data/possum.txt', sep='\t')
regression = linregress(df.headL, df.totalL)
regression
LinregressResult(slope=0.8336697990278819, intercept=9.888233331751707, rvalue=0.6910936973935056, pvalue=4.680578654379419e-16, stderr=0.08632851506979797)
regression
LinregressResult(slope=0.8336697990278819, intercept=9.888233331751707, rvalue=0.6910936973935056, pvalue=4.680578654379419e-16, stderr=0.08632851506979797)
There are five features in this output. Of immediate importance are the slope and intercept, which describe the regression line. Those are basically the coefficients a and b we've been talking about. Thus, the line y=0.8336697990278819x+9.888233331751707 should be a reasonably good fit to the data. If we plug x=101 in, we should get about y=94 out:
0.8336697990278819*101 + 9.888233331751707
94.08888303356777
Let's take on more look at the output of that regression:
regression
LinregressResult(slope=0.8336697990278819, intercept=9.888233331751707, rvalue=0.6910936973935056, pvalue=4.680578654379419e-16, stderr=0.08632851506979797)
In addition to slope
and intercept
we see
pvalue
, which is a p-value for a hypothesis test that we'll talk about soon,stderr
, which seems like it should be related to the p-value, andrvalue
- ???The "rvalue" is the the correlation between the two variables which is a measure of the strength of the linear relationship between them.
Here are a couple more examples - one fun and one not.
Correlating wins and losses to stats in College Football