Sometimes, two numerical variables are related. In the simplest case, we might hope that the relationship is linear; thus, a response variable \(Y\) might be related to an explanatory variable \(X\) via \[Y = a \times X + b.\] Such a model is called a *linear model*.

In statistics, of course, \(X\) and \(Y\) are random variables and we don’t expect the model to match the data exactly; rather, we ask how confident we are in the model given the errors that we see.

A first step in understanding the type of data we deal with in regression is to try to visualize it. The basic visualization for this purpose is called a *scatter plot*.

Here’s an example lifted straight from our textbook which is based on a 1995 paper entitled “Morphological variation among columns of the mountain brushtail possum” which appeared in the *Australian Journal of Zoology*.

The paper provides data on several morphological feathers of possums from down under, such as head length, total length, and tail length. This data was measured off of 104 possums. Here’s a scatter plot of head length (in mm) vs total length (in cm).

Note that (as we might expect) longer possums generally have longer heads. The correlation is a quantitative measure of this relationship that we’ll talk about in a bit.

Taken from our CDC data.