Scatter plots and correlation

Scatter plots

Often, we would like to know if two variables are related. A scatter plot is simply a plot of point pairs. This simple geometric tool can help us assess relationships

Illustrations using actual data

Hurricane prediction

This timely scatter plot taken from our text illustrates the average error in hurricane prediction vs the year in which those predictions were made. The plot reveals the fact that predictions are generally improving.

The correlation of about \(-0.83\) is a quantitative assessment of the relationship between the variables.

Height vs weight

Taken from our CDC data.

Galileo’s ramp

Taken from Galileo’s Gravity and Motion Experiments.

The correlation is actually quite misleading in this example because a quadratic relationship is more appropriate.

Random illustrations

Here are a few scatter plots of randomly generated data to further illustrate the ideas.

A perfect linear relationship

A close to linear relationship

A close to linear, but negative, relationship

A nonlinear relationship

No relationship

Quantifying the relationship

There are a few ways to quantify the relationships we see. Today, we’ll learn about correlation. We’ll talk about another technique, called regression, later.

When it comes to quantifying relationships, we’ll generally assume that we’re working with a list of data points that looks like \[(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n).\]

Correlation

The basic formula for the correlation of our list of data points is

\[ R = \frac{1}{n-1}\sum_{i=1}^{n} \frac{x_i-\bar{x}}{s_x}\frac{y_i-\bar{y}}{s_y} = \frac{1}{n-1}\sum_{i=1}^{n} z_{x_i}z_{y_i}, \] where \(\bar{x}\), \(\bar{y}\), \(s_x\), and \(s_y\) are the sample means and standard deviations for each variable.

Comments

The correlation is always between \(-1\) and \(+1\) and
- A number close to \(+1\) indicates a strong, positive linear relationship,
- A number close to \(-1\) indicates a strong, negative linear relationship,
- A number close \(0\) indicates a weak linear relationship.
The simplified version, with the \(z_{x_i}z_{y_i}\), emphasizes that we’re really just multiplying the \(z\)-scores together and adding the results.
- The correlation tends to be larger when the signs of the \(z\)-scores agree.
- The correlation tends to be smaller when the signs of the \(z\)-scores disagree.
The idea is illustrated in the figure below where
- The green points contribute positively to the correlation and
- The red points contribute negatively to the correlation

A short computation

Let’s suppose our list of \(x\) values and our list of \(y\) values are

\[ \begin{align} x &= 3,4,6,9 \\ y &= 1,6,9,10 \end{align} \] Then, their averages are \[ \begin{align} \bar{x} &= \frac{3+4+6+9}{4} = 5.5, \text{ and } \\ \bar{y} &= \frac{1+6+9+10}{5} = 6.5 \end{align} \] and their standard deviations are \[ \begin{align} s_x &= \sqrt{\frac{(3-5.5)^2+(4-5.5)^2+(6-5.5)^2+(9-5.5)^2}{3}} = 2.65, \text{ and } \\ s_y &= \sqrt{\frac{(1-6.5)^2+(6-6.5)^2+(9-6.5)^2+(10-6.5)^2}{3}} = 4.04. \end{align} \] Thus, the correlation is \[ R = \frac{1}{3}\left( \left(\frac{3-5.5}{2.65}\right)\left(\frac{5-6.5}{4.04}\right) + \left(\frac{4-5.5}{2.65}\right)\left(\frac{7-6.5}{4.04}\right) + \left(\frac{6-5.5}{2.65}\right)\left(\frac{7-6.5}{4.04}\right) + \left(\frac{9-5.5}{2.65}\right)\left(\frac{10-6.5}{4.04}\right)\right) = 0.87 \]

Correlation vs causation

In spite of it’s importance, it should be understood that correlation does not imply causation.

Lurking variables

A lurking or confounding variable is an unspecified variable that might be related to the variables under study. In both of the following examples, scatter plots and correlation computations show a strong relationship between the variables:

Number of smartphones in a country and overall public health
Public funding of education in a school district and exam scores

Are there any lurking variables?

Explanatory and response variables

If a researcher suspects genuine causation, they might designate one variable as explanatory and one as response. Let’s try to identify the natural explanatory and response variables in the following situations:

Number of hours studying a week and GPA
GPA and Number of hours studying a week
Light level and reading comprehension
Per capita income and percentage of population with college degrees