Midterm review¶

Next Wednesday is the official middle of the term so next Monday, we'll have an eight to ten problem MyOpenMath assessment before I figure out midterm grades. This week, our main objective is to get you ready!

Data¶

We started the semester looking at data. You should understand the difference between

Numerical data and
Categorical data.

In fact, we stated that one simple characterization of statistics is

the study of how to collect, analyze, and draw conclusions from data.

Computations¶

Jumping ahead a bit, we've more recently been doing computations with data. You should understand the distinction between

A mean for numerical data and
a proportion for categorical data,

as well as the concept of standard deviation for both of those. We'll review some of that a bit more in depth later.

Studies¶

Fundamentally, data comes from studies. You should understand the difference between

Experiments and
Observational studies

Tables¶

Data is often presented in a table. You should be able to look at a table and identify types of variables.

import pandas as pd
df = pd.read_csv('https://marksmath.org/data/100MeterTimes.csv')
df.tail(6)

	year	place	athlete	Country	time
39	2013	1	Usain Bolt	Jamaica	9.77
40	2013	2	Justin Gatlin	USA	9.85
41	2013	3	Nesta Carter	Jamaica	9.95
42	2015	1	Usain Bolt	Jamaica	9.79
43	2015	2	Justin Gatlin	USA	9.80
44	2015	3	Trayvon Bromell	USA	9.92

The normal distribution¶

We model data with distributions - the most important of which is certainly the normal distribution.

You should be able to do fundamental computations with the normal distribution - not only for because this is useful in and of itself, but as we've seen, confidence intervals and hypothesis tests are built on top of these ideas.

Example¶

Suppose that scores on an exam are normally distributed with a mean of 70 and a standard deviation of 8. What percentage of scores are less than 80?

Solution: The $Z$-score for 80 is

Z = (80-70)/8
Z

1.25

Looking this up in a normal table we find that just over $89\%$ of scores are less than 80.

Sampling distributions¶

You should understand that the normal distribution arises when aggregation operations (such as a sum or mean) are applied to a sampling process - regardless of whether the underlying data is normally distributed or not.

You should understand that the normal distribution arises as the result of a limiting process so that the approximation is only valid for a large enough sample size.

You should also be able to compute means and standard deviations that arise from this process. Note that a standard deviation that arises from a sampling process is often called the standard error.

An example with numerical data¶

The 39 responses to our beginning of class survey yielded an average height of 5.83 feet with a standard deviation of 0.398 feet.

Use this sample to write down a $90\%$ confidence interval for average height.
Can you think of any reasons that our sample might be suspect?

Solution¶

Recall that confidence intervals for means have the form

$$[\bar{x}-z^* \sigma/\sqrt{n}, \bar{x}+z^* \sigma/\sqrt{n}].$$

We just need to identify each term. Most of the terms are given in the problem:

$\bar{x} = 5.83$
$\sigma = 0.398$
$n=39$

We can read the last term off of our normal table to find the $z^*$ multiplier or we can use are new calculator page to find that $$z^* \approx 1.64342123.$$ Thus, our interval is [5.72526, 5.9347].

An example with categorical data¶

Of the 39 responses to our beginning of class survey 2 folks were left handed.

Use this sample to write down a $98\%$ confidence interval for the proportion of people who are left handed.
Can you think of any reasons that our sample might be suspect?

Solution¶

Confidence intervals still have the same form but the way we compute the standard error is different. We have

$$\left[\hat{p}-z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p}+z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right],$$

where

$\hat{p} = 2/39 \approx 0.051282$
$n=39$
$z^* = 2.3213$ from our new calculator.

Thus, we get

$$[-0.0307, 0.1333] \text{ or } [0,0.1333].$$

Hypothesis tests¶

Finally, we've discussed hypothesis tests. Since it's so recent, let just jump into an example:

Suppose we wish to test the claim that the proportion of men who own cats is larger than 25% at the 0.005 significance level based on a sample of 85 men, 35 of whom were cat owners.

First off, this is a one-sided hypothesis test that could be written

$H_0: p=0.25$
$H_A: p>0.25.$

Computations¶

The $Z$-score can always be computed with a calculator. Of course, you can do it with Python as well:

from numpy import sqrt
phat = 35/85
p = 0.25
n = 85
se = np.sqrt(p*(1-p)/n)
(phat-p)/se

3.444233600968322

This is a really big $Z$-score so we'd certainly reject the null. The online HW wants a $p$-value to 4 digits and our new calculator tells us $$P(Z>3.4442336) = 0.999713659.$$ This our $p$-value is $1-0.999713659$ or $0.0003$ to four digits.