Overview

Sometimes, two numerical variables are related. In the simplest case, we might hope that the relationship is linear; thus, a response variable \(Y\) might be related to an explanatory variable \(X\) via \[Y = a \times X + b.\] Such a model is called a linear model.

In statistics, of course, \(X\) and \(Y\) are random variables and we don’t expect the model to match the data exactly; rather, we ask how confident we are in the model given the errors that we see.

Scatter plots

A first step in understanding the type of data we deal with in regression is to try to visualize it. The basic visualization for this purpose is called a scatter plot.

Our textbook’s possums

Here’s an example lifted straight from our textbook which is based on a 1995 paper entitled “Morphological variation among columns of the mountain brushtail possum” which appeared in the Australian Journal of Zoology.

The paper provides data on several morphological feathers of possums from down under, such as head length, total length, and tail length. This data was measured off of 104 possums. Here’s a scatter plot of head length (in mm) vs total length (in cm).

Note that (as we might expect) longer possums generally have longer heads. The correlation is a quantitative measure of this relationship that we’ll talk about in a bit.

Height vs weight

Taken from our CDC data.

Galileo’s ramp

Taken from Galileo’s Gravity and Motion Experiments.

The correlation is actually quite misleading in this example because a quadratic relationship is more appropriate.

Random illustrations

Here are a few scatter plots of randomly generated data to further illustrate the ideas.

A perfect linear relationship

A close to linear relationship

A close to linear, but negative, relationship

A nonlinear relationship

No relationship

Quantifying the relationship

There are a few ways to quantify the relationships we see. Today, we’ll learn about correlation. We’ll talk about another technique, called regression, later.

When it comes to quantifying relationships, we’ll generally assume that we’re working with a list of data points that looks like \[(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n).\]

Correlation

The basic formula for the correlation of our list of data points is

\[ R = \frac{1}{n-1}\sum_{i=1}^{n} \frac{x_i-\bar{x}}{s_x}\frac{y_i-\bar{y}}{s_y} = \frac{1}{n-1}\sum_{i=1}^{n} z_{x_i}z_{y_i}, \] where \(\bar{x}\), \(\bar{y}\), \(s_x\), and \(s_y\) are the sample means and standard deviations for each variable.

Comments

  • The correlation is always between \(-1\) and \(+1\) and
    • A number close to \(+1\) indicates a strong, positive linear relationship,
    • A number close to \(-1\) indicates a strong, negative linear relationship,
    • A number close \(0\) indicates a weak linear relationship.
  • The simplified version, with the \(z_{x_i}z_{y_i}\), emphasizes that we’re really just multiplying the \(z\)-scores together and adding the results.
    • The correlation tends to be larger when the signs of the \(z\)-scores agree.
    • The correlation tends to be smaller when the signs of the \(z\)-scores disagree.
  • The idea is illustrated in the figure below where
    • The green points contribute positively to the correlation and
    • The red points contribute negatively to the correlation

Interpretation of \(R^2\)

\(R\) is sometimes called *the coefficient of correlation* and \(R^2\) is sometimes called *the coefficient of determination*. The interpretation is that \(R^2\) represents the proportion of the variance determined by the model. This is because, after a little algebra, it turns out that: \[R^2 = \frac{\text{Variance explained by the model}}{\text{Total variance}}.\]

A short computation

Let’s suppose our list of \(x\) values and our list of \(y\) values are

\[ \begin{align} x &= 3,4,6,9 \\ y &= 1,6,9,10 \end{align} \] Then, their averages are \[ \begin{align} \bar{x} &= \frac{3+4+6+9}{4} = 5.5, \text{ and } \\ \bar{y} &= \frac{1+6+9+10}{5} = 6.5 \end{align} \] and their standard deviations are \[ \begin{align} s_x &= \sqrt{\frac{(3-5.5)^2+(4-5.5)^2+(6-5.5)^2+(9-5.5)^2}{3}} = 2.65, \text{ and } \\ s_y &= \sqrt{\frac{(1-6.5)^2+(6-6.5)^2+(9-6.5)^2+(10-6.5)^2}{3}} = 4.04. \end{align} \] Thus, the correlation is \[ R = \frac{1}{3}\left( \left(\frac{3-5.5}{2.65}\right)\left(\frac{5-6.5}{4.04}\right) + \left(\frac{4-5.5}{2.65}\right)\left(\frac{7-6.5}{4.04}\right) + \left(\frac{6-5.5}{2.65}\right)\left(\frac{7-6.5}{4.04}\right) + \left(\frac{9-5.5}{2.65}\right)\left(\frac{10-6.5}{4.04}\right)\right) = 0.87 \]

Correlation vs causation

In spite of it’s importance, it should be understood that correlation does not imply causation.

A little mathematics

There are a few mathematical details you should know to understand how we model a scatter plot that looks linear.

Equations of lines

The graph of an equation of the form \(y=ax+b\) is a line with slope \(a\) and \(y\)-intercept \(b\). Here’s a simple example.

We typically say that \(x\) is the independent variable and that \(y\) is the dependent variable. In stat speak, this is … ?

Given an equation like this, it’s easy to compute the dependent variable in terms of the independent. If \(y=2x+1\) and \(x=3\), then \(y = 2\times3+1 = 7\).

Prediction

In the context of working with data, the regression line can be used to predict values for the response variable given a value of the explanatory variable that is missing or outside the range of the given data.

Example 1 - Hurricane prediction

This scatter plot illustrates the average error in hurricane prediction vs the year in which those predictions were made. The plot reveals the fact that predictions are generally improving.

The correlation of about \(-0.83\) a strong negative relationship between the variables.

The regression line \(E = -8.33 \times Y + 16870.37\) states what the relationship is. What does the model predict for our error in 2020?