# Midterm review¶

Next Wednesday is the official middle of the term so next Monday, we'll have an eight to ten problem MyOpenMath assessment before I figure out midterm grades. This week, our main objective is to get you ready!

## Data¶

We started the semester looking at data. You should understand the difference between

• Numerical data and
• Categorical data.

In fact, we stated that one simple characterization of statistics is

the study of how to collect, analyze, and draw conclusions from data.

### Computations¶

Jumping ahead a bit, we've more recently been doing computations with data. You should understand the distinction between

• A mean for numerical data and
• a proportion for categorical data,

as well as the concept of standard deviation for both of those. We'll review some of that a bit more in depth later.

### Studies¶

Fundamentally, data comes from studies. You should understand the difference between

• Experiments and
• Observational studies

### Tables¶

Data is often presented in a table. You should be able to look at a table and identify types of variables.

import pandas as pd
df = pd.read_csv('https://marksmath.org/data/100MeterTimes.csv')
df.tail(6)

year place athlete Country time
39 2013 1 Usain Bolt Jamaica 9.77
40 2013 2 Justin Gatlin USA 9.85
41 2013 3 Nesta Carter Jamaica 9.95
42 2015 1 Usain Bolt Jamaica 9.79
43 2015 2 Justin Gatlin USA 9.80
44 2015 3 Trayvon Bromell USA 9.92

## The normal distribution¶

We model data with distributions - the most important of which is certainly the normal distribution.

You should be able to do fundamental computations with the normal distribution - not only for because this is useful in and of itself, but as we've seen, confidence intervals and hypothesis tests are built on top of these ideas.

### Example¶

Suppose that scores on an exam are normally distributed with a mean of 70 and a standard deviation of 8. What percentage of scores are less than 80?

Solution: The $Z$-score for 80 is

Z = (80-70)/8
Z

1.25

Looking this up in a normal table we find that just over $89\%$ of scores are less than 80.

### Sampling distributions¶

You should understand that the normal distribution arises when aggregation operations (such as a sum or mean) are applied to a sampling process - regardless of whether the underlying data is normally distributed or not.

You should understand that the normal distribution arises as the result of a limiting process so that the approximation is only valid for a large enough sample size.

You should also be able to compute means and standard deviations that arise from this process. Note that a standard deviation that arises from a sampling process is often called the standard error.

### An example with numerical data¶

The 39 responses to our beginning of class survey yielded an average height of 5.83 feet with a standard deviation of 0.398 feet.

• Use this sample to write down a $90\%$ confidence interval for average height.
• Can you think of any reasons that our sample might be suspect?

#### Solution¶

Recall that confidence intervals for means have the form

$$[\bar{x}-z^* \sigma/\sqrt{n}, \bar{x}+z^* \sigma/\sqrt{n}].$$

We just need to identify each term. Most of the terms are given in the problem:

• $\bar{x} = 5.83$
• $\sigma = 0.398$
• $n=39$

We can read the last term off of our normal table to find the $z^*$ multiplier or we can use are new calculator page to find that $$z^* \approx 1.64342123.$$ Thus, our interval is [5.72526, 5.9347].

### An example with categorical data¶

Of the 39 responses to our beginning of class survey 2 folks were left handed.

• Use this sample to write down a $98\%$ confidence interval for the proportion of people who are left handed.
• Can you think of any reasons that our sample might be suspect?

### Solution¶

Confidence intervals still have the same form but the way we compute the standard error is different. We have

$$\left[\hat{p}-z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p}+z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right],$$

where

• $\hat{p} = 2/39 \approx 0.051282$
• $n=39$
• $z^* = 2.3213$ from our new calculator.

Thus, we get

$$[-0.0307, 0.1333] \text{ or } [0,0.1333].$$

## Hypothesis tests¶

Finally, we've discussed hypothesis tests. Since it's so recent, let just jump into an example:

Suppose we wish to test the claim that the proportion of men who own cats is larger than 25% at the 0.005 significance level based on a sample of 85 men, 35 of whom were cat owners.

First off, this is a one-sided hypothesis test that could be written

• $H_0: p=0.25$
• $H_A: p>0.25.$

### Computations¶

The $Z$-score can always be computed with a calculator. Of course, you can do it with Python as well:

from numpy import sqrt
phat = 35/85
p = 0.25
n = 85
se = np.sqrt(p*(1-p)/n)
(phat-p)/se

3.444233600968322

This is a really big $Z$-score so we'd certainly reject the null. The online HW wants a $p$-value to 4 digits and our new calculator tells us $$P(Z>3.4442336) = 0.999713659.$$ This our $p$-value is $1-0.999713659$ or $0.0003$ to four digits.