Relating data sets

Very commonly, we wish to compare two data sets. To do so, we’ll apply the same hypothesis test that we’ve used before but with a mean and standard deviation specially chosen to match the problem at hand. Depending upon the degrees of freedom, we might choose to compute the \(p\)-value using a \(t\)-distribution or a normal distribution.

A concrete example

As an example, we can compare the running speeds of two age groups as I happen to have the results from the 2015 Peachtree Road Race stored on my webspace. Suppose, for example, that we wish to explore the question: Are men between the ages of 35 and 40 generally faster than men between the ages of 45 and 50?

To answer this question, we’ll read in our entire data set and grab a random sample of size 100 from each group of interest. While were at it, let’s compute the mean of each group.

df = read.csv('https://www.marksmath.org/data/peach_tree2015.csv')
men = subset(df, Gender='M')
young = subset(men, 35<=Age & Age<40)
old = subset(men, 45<=Age & Age<50)

set.seed(1) # For reproducibility
young_times = sample(young$Net.Time, 100)
old_times = sample(old$Net.Time, 100)

mu_young = mean(young_times)
mu_old = mean(old_times)
c(mu_young,mu_old)

## [1] 73.41334 78.38337

As expected, the average of the younger times is less than average of the older times. What can we infer, though, about the general population from this small sample?

Comparing two means with the t-test

The fabulous t.test command provides the ability to compare two means directly. If we have two groups with means \(\mu_1\) and \(\mu_2\), the logical hypotheses statements are something like

\(H_0: \mu_1 = \mu_2\)
\(H_A: \mu_1 \neq \mu_2\)

In our case, if \(\mu_1\) represents the average of the younger times and \(\mu_2\) represents the average of the older times, we might be more interested in hypotheses like

\(H_0: \mu_1 = \mu_2\)
\(H_A: \mu_1 < \mu_2\)

We can run this like so:

t.test(young_times, old_times, alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  young_times and old_times
## t = -1.7147, df = 191.39, p-value = 0.04401
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##        -Inf -0.1792582
## sample estimates:
## mean of x mean of y 
##  73.41334  78.38337

The p-value indicates that, at a 95% confidence level, ther is a genuine difference.

Behind the computation

The ideas behind the computations that produce the above \(t\)-test are quite similar to the hypothesis tests we’ve done before. The major new ingredients are formulae to combine the measures of the two original datasets into one. To do so, suppose that

\(D_1\) and \(D_2\) are two sets of sample observations,
with \(n_1\) and \(n_2\) elements respectively.
The mean of \(D_1\) is \(\bar{x}_1\) and the mean of \(D_2\) is \(\bar{x}_2\)
The standard deviation of \(D_1\) is \(\sigma_1\) and the standard deviation of \(D_2\) is \(\sigma_2\)

Then, we analyze the difference of the two means using a \(t\)-test with

Mean \(\bar{x}_1 - \bar{x}_2\),
Standard error \[\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}},\]
and we use the minimum of \(n_1-1\) and \(n_2-1\) as the degrees of freedom.

In this set up, the expression \[\frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}\] is often called the test statistic.

We should point out that these same techniques work for large sample sizes but, when the sample sizes are so large that the degrees of freedom exceeds 30, we can just use the normal distribution, rather than the \(t\)-distribution. This works since the normal distribution is a limit of \(t\)-distributions. Thus, the t.test command works regardless of sample sizes.

Example

Sam thinks that there is a difference in quality of life between rural and urban living. He collects information from obituaries in newspapers from urban and rural towns in Idaho to see if there is a difference in life expectancy. A sample of 4 people from rural towns give a life expectancy of \(\bar{x}_r = 42\) years with a standard deviation of \(\sigma_r=6.99\) years. A sample of 6 people from larger towns give \(\bar{x}_u=81.9\) years and \(\sigma_u=5.64\) years. Does this provide evidence that people living in rural Idaho communities have a shorter life expectancy than those in more urban communities to a \(95\%\) level of confidence?

Solution: Our hypothesis test looks like

\(H_0: \mu_u = \mu_r\)
\(H_A: \mu_u \neq \mu_r.\)

There are three degrees of freedom and the test statistic is

\[\frac{\bar{x}_r - \bar{x}_u}{\sqrt{\frac{\sigma_1^r}{n_r} + \frac{\sigma_u^2}{n_u}}} = \frac{72-81.9}{\sqrt{\frac{6.99^2}{4}+\frac{5.64^2}{6}}} = -2.36543.\]

We have a two-sided alternative hypothesis, thus we compute the \(p\)-value via

2*pt(-2.36543, 3)

## [1] 0.09891221

I guess we cannot reject the null hypothesis.

Alternatively, we could use a table. If we look at our \(t\)-table, we see something that looks like so:

one tail	0.100	0.050	0.025	0.010	0.005
two tails	0.200	0.100	0.050	0.020	0.010
df 1	3.08	6.31	12.71	31.82	63.66
2	1.89	2.92	4.30	6.96	9.92
3	1.64	2.35	3.18	4.54	5.84
4	1.53	2.13	2.78	3.75	4.60

The entries in this table are critical \(t^*\) values. The columns indicate several common choices for confidence level and are alternately labeled either one-sided or two. The rows correspond to degrees of freedom. Thus, you can figure out where a given \(t\)-score lies relative to your critical value.

Look in the row where \(df=3\). As we move from left to right along this row, the corresponding \(p\)-values must decrease. We are interested in the column where the two-sided test is equal to 0.05. The corresponding \(t^*\) value in our row and column is 3.16. Since our \(t\)-score of 2.36 is less than that, the \(p\)-value must be larger than \(0.05\), thus we (again) fail to reject the null-hypothesis.

Comparing two sample proportions

The ideas behind the comparison of two sample proportions is very similar to the ideas behind the comparison of two sample means. We’ve just got to figure out the correct formulation and parameters to use in our \(t\)-test.

A problem

Let’s illustrate the ideas in the context of a problem. It’s widely believed that Trump’s support among men is stronger than his support among women. Let’s use some data to test this.

According to a recent Reuter’s poll Trump’s most recent approval rating stands at 40% but there appears to be a difference between the views of men and the views of women. Among the 1009 men surveyed, 44% approve of Trump. Among the 1266 women surveyed, only 36% approve of Trump.

Does this data support our conjecture that Trump’s support among men is higher than that among women to a 95% level of confidence?

Solution: Let’s first clearly state our hypotheses. Let’s suppose that \(p_m\) represents the proportion of men who support Trump and \(p_w\) represent the proportion of women who support Trump. Our hypothesis test can be written

\(H_0: p_m = p_w\) or, put another way, \(p_m-p_w = 0\)
\(H_A: p_m > p_w\) or, put another way, \(p_m-p_w > 0\).

The point behind the reformulation to compare with zero is that it gives us just one number that we can apply a standard \(t\)-test to. Now, we have measured proportions of \(\hat{p}_m = 0.44\) and \(\hat{p}_w = 0.36\). Thus, we want to run our test with \[\hat{p} = \hat{p}_m - \hat{p}_w = 0.44 - 0.36 = 0.08.\] We want just one standard error as well, which we get by adding the variances in the two samples. That is, \[SE = \sqrt{\frac{\hat{p}_m(1-\hat{p}_m)}{n_m} + \frac{\hat{p}_w(1-\hat{p}_w)}{n_w}} = \sqrt{\frac{0.44 \times 0.56}{1009} + \frac{0.36 \times 0.64}{1266}} \approx 0.02064.\]

Of course, I computed this with R:

pm = 0.44
pw = 0.36
se = sqrt( pm*(1-pm)/1009 + pw*(1-pw)/1266 )
se

## [1] 0.02064444

We can now compute our test statistic: \[T = \frac{\hat{p}_m - \hat{p}_w}{SE}.\] via

t = 0.08/se
t

## [1] 3.875136

With this very large test statistic, we can reject the null hypothesis and conclude with confidence that there is a difference between the way that men and women view Trump.

Alternative formulae

We have learned the most basic version of the Student’s \(t\)-test. There are a number of variations which involve slighltly different forumulae. R’s t.test command, for example, uses the Welch’s \(t\)-test, which has a much more complicated formula for the degrees of freedom parameter.

Importantly, MyOpenMath uses the following formula for computing standard error for proportions:

\[SE = \sqrt{\frac{\bar{p}(1-\bar{p})}{n_1} + \frac{\bar{p}(1-\bar{p})}{n_2}},\] where \[\bar{p} = \frac{\hat{p}_1 n_1 + \hat{p}_2 n_2}{n_1+n_2}.\]

Thus, \(\bar{p}\) is simply a weighted average of the two measured proportions \(\hat{p}_1\) and \(\hat{p}_2\). In the Trump approval poll example, we’d get:

\[\bar{p} = \frac{0.44 \times 1009 + 0.36 \times 1266}{1009+1266} = 0.395481\] so that \[SE = \sqrt{\frac{0.395\times0.605}{1009} + \frac{0.395\times0.605}{1266}}=0.02063.\]

Note that our prior computation of the standard error yielded \(0.02064\) so this is rarely a big deal.