Often, we would like to know if two variables are related. A scatter plot is simply a plot of point pairs. This simple geometric tool can help us assess relationships
This timely scatter plot taken from our text illustrates the average error in hurricane prediction vs the year in which those predictions were made. The plot reveals the fact that predictions are generally improving.
The correlation of about \(-0.83\) is a quantitative assessment of the relationship between the variables.
Taken from our CDC data.
Taken from Galileo’s Gravity and Motion Experiments.
The correlation is actually quite misleading in this example because a quadratic relationship is more appropriate.
Here are a few scatter plots of randomly generated data to further illustrate the ideas.
There are a few ways to quantify the relationships we see. Today, we’ll learn about correlation. We’ll talk about another technique, called regression, later.
When it comes to quantifying relationships, we’ll generally assume that we’re working with a list of data points that looks like \[(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n).\]
The basic formula for the correlation of our list of data points is
\[ R = \frac{1}{n-1}\sum_{i=1}^{n} \frac{x_i-\bar{x}}{s_x}\frac{y_i-\bar{y}}{s_y} = \frac{1}{n-1}\sum_{i=1}^{n} z_{x_i}z_{y_i}, \] where \(\bar{x}\), \(\bar{y}\), \(s_x\), and \(s_y\) are the sample means and standard deviations for each variable.
Let’s suppose our list of \(x\) values and our list of \(y\) values are
\[ \begin{align} x &= 3,4,6,9 \\ y &= 1,6,9,10 \end{align} \] Then, their averages are \[ \begin{align} \bar{x} &= \frac{3+4+6+9}{4} = 5.5, \text{ and } \\ \bar{y} &= \frac{1+6+9+10}{5} = 6.5 \end{align} \] and their standard deviations are \[ \begin{align} s_x &= \sqrt{\frac{(3-5.5)^2+(4-5.5)^2+(6-5.5)^2+(9-5.5)^2}{3}} = 2.65, \text{ and } \\ s_y &= \sqrt{\frac{(1-6.5)^2+(6-6.5)^2+(9-6.5)^2+(10-6.5)^2}{3}} = 4.04. \end{align} \] Thus, the correlation is \[ R = \frac{1}{3}\left( \left(\frac{3-5.5}{2.65}\right)\left(\frac{5-6.5}{4.04}\right) + \left(\frac{4-5.5}{2.65}\right)\left(\frac{7-6.5}{4.04}\right) + \left(\frac{6-5.5}{2.65}\right)\left(\frac{7-6.5}{4.04}\right) + \left(\frac{9-5.5}{2.65}\right)\left(\frac{10-6.5}{4.04}\right)\right) = 0.87 \]
In spite of it’s importance, it should be understood that correlation does not imply causation.
A lurking or confounding variable is an unspecified variable that might be related to the variables under study. In both of the following examples, scatter plots and correlation computations show a strong relationship between the variables:
Are there any lurking variables?
If a researcher suspects genuine causation, they might designate one variable as explanatory and one as response. Let’s try to identify the natural explanatory and response variables in the following situations:
Comments