Sometimes, two numerical variables are related. In the simplest case, we might hope that the relationship is linear; thus, a response variable \(Y\) might be related to an explanatory variable \(X\) via \[Y = a \times X + b.\] Such a model is called a linear model.
In statistics, of course, \(X\) and \(Y\) are random variables and we don’t expect the model to match the data exactly; rather, we ask how confident we are in the model given the errors that we see.
A first step in understanding the type of data we deal with in regression is to try to visualize it. The basic visualization for this purpose is called a scatter plot.
Here’s an example lifted straight from our textbook which is based on a 1995 paper entitled “Morphological variation among columns of the mountain brushtail possum” which appeared in the Australian Journal of Zoology.
The paper provides data on several morphological feathers of possums from down under, such as head length, total length, and tail length. This data was measured off of 104 possums. Here’s a scatter plot of head length (in mm) vs total length (in cm).
Note that (as we might expect) longer possums generally have longer heads. The correlation is a quantitative measure of this relationship that we’ll talk about in a bit.
Taken from our CDC data.
Taken from Galileo’s Gravity and Motion Experiments.
The correlation is actually quite misleading in this example because a quadratic relationship is more appropriate.
Here are a few scatter plots of randomly generated data to further illustrate the ideas.
There are a few ways to quantify the relationships we see. Today, we’ll learn about correlation. We’ll talk about another technique, called regression, later.
When it comes to quantifying relationships, we’ll generally assume that we’re working with a list of data points that looks like \[(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n).\]
The basic formula for the correlation of our list of data points is
\[ R = \frac{1}{n-1}\sum_{i=1}^{n} \frac{x_i-\bar{x}}{s_x}\frac{y_i-\bar{y}}{s_y} = \frac{1}{n-1}\sum_{i=1}^{n} z_{x_i}z_{y_i}, \] where \(\bar{x}\), \(\bar{y}\), \(s_x\), and \(s_y\) are the sample means and standard deviations for each variable.
\(R\) is sometimes called *the coefficient of correlation* and \(R^2\) is sometimes called *the coefficient of determination*. The interpretation is that \(R^2\) represents the proportion of the variance determined by the model. This is because, after a little algebra, it turns out that: \[R^2 = \frac{\text{Variance explained by the model}}{\text{Total variance}}.\]
Let’s suppose our list of \(x\) values and our list of \(y\) values are
\[ \begin{align} x &= 3,4,6,9 \\ y &= 1,6,9,10 \end{align} \] Then, their averages are \[ \begin{align} \bar{x} &= \frac{3+4+6+9}{4} = 5.5, \text{ and } \\ \bar{y} &= \frac{1+6+9+10}{5} = 6.5 \end{align} \] and their standard deviations are \[ \begin{align} s_x &= \sqrt{\frac{(3-5.5)^2+(4-5.5)^2+(6-5.5)^2+(9-5.5)^2}{3}} = 2.65, \text{ and } \\ s_y &= \sqrt{\frac{(1-6.5)^2+(6-6.5)^2+(9-6.5)^2+(10-6.5)^2}{3}} = 4.04. \end{align} \] Thus, the correlation is \[ R = \frac{1}{3}\left( \left(\frac{3-5.5}{2.65}\right)\left(\frac{5-6.5}{4.04}\right) + \left(\frac{4-5.5}{2.65}\right)\left(\frac{7-6.5}{4.04}\right) + \left(\frac{6-5.5}{2.65}\right)\left(\frac{7-6.5}{4.04}\right) + \left(\frac{9-5.5}{2.65}\right)\left(\frac{10-6.5}{4.04}\right)\right) = 0.87 \]
In spite of it’s importance, it should be understood that correlation does not imply causation.
There are a few mathematical details you should know to understand how we model a scatter plot that looks linear.
The graph of an equation of the form \(y=ax+b\) is a line with slope \(a\) and \(y\)-intercept \(b\). Here’s a simple example.
We typically say that \(x\) is the independent variable and that \(y\) is the dependent variable. In stat speak, this is … ?
Given an equation like this, it’s easy to compute the dependent variable in terms of the independent. If \(y=2x+1\) and \(x=3\), then \(y = 2\times3+1 = 7\).
In the context of working with data, the regression line can be used to predict values for the response variable given a value of the explanatory variable that is missing or outside the range of the given data.
This scatter plot illustrates the average error in hurricane prediction vs the year in which those predictions were made. The plot reveals the fact that predictions are generally improving.
The correlation of about \(-0.83\) a strong negative relationship between the variables.
The regression line \(E = -8.33 \times Y + 16870.37\) states what the relationship is. What does the model predict for our error in 2020?
Comments