Midterm review

Next Wednesday is the official middle of the term so next Monday, we'll have an eight to ten problem MyOpenMath assessment before I figure out midterm grades. This week, our main objective is to get you ready!

Data

We started the semester looking at data. You should understand the difference between

  • Numerical data and
  • Categorical data.

In fact, we stated that one simple characterization of statistics is

the study of how to collect, analyze, and draw conclusions from data.

Computations

Jumping ahead a bit, we've more recently been doing computations with data. You should understand the distinction between

  • A mean for numerical data and
  • a proportion for categorical data,

as well as the concept of standard deviation for both of those. We'll review some of that a bit more in depth later.

Studies

Fundamentally, data comes from studies. You should understand the difference between

  • Experiments and
  • Observational studies

Tables

Data is often presented in a table. You should be able to look at a table and identify types of variables.

import pandas as pd
df = pd.read_csv('https://marksmath.org/data/100MeterTimes.csv')
df.tail(6)
year place athlete Country time
39 2013 1 Usain Bolt Jamaica 9.77
40 2013 2 Justin Gatlin USA 9.85
41 2013 3 Nesta Carter Jamaica 9.95
42 2015 1 Usain Bolt Jamaica 9.79
43 2015 2 Justin Gatlin USA 9.80
44 2015 3 Trayvon Bromell USA 9.92

The normal distribution

We model data with distributions - the most important of which is certainly the normal distribution.

You should be able to do fundamental computations with the normal distribution - not only for because this is useful in and of itself, but as we've seen, confidence intervals and hypothesis tests are built on top of these ideas.

Example

Suppose that scores on an exam are normally distributed with a mean of 70 and a standard deviation of 8. What percentage of scores are less than 80?

Solution: The $Z$-score for 80 is

Z = (80-70)/8
Z
1.25

Looking this up in a normal table we find that just over $89\%$ of scores are less than 80.

Sampling distributions

You should understand that the normal distribution arises when aggregation operations (such as a sum or mean) are applied to a sampling process - regardless of whether the underlying data is normally distributed or not.

You should understand that the normal distribution arises as the result of a limiting process so that the approximation is only valid for a large enough sample size.

You should also be able to compute means and standard deviations that arise from this process. Note that a standard deviation that arises from a sampling process is often called the standard error.

An example with numerical data

The 39 responses to our beginning of class survey yielded an average height of 5.83 feet with a standard deviation of 0.398 feet.

  • Use this sample to write down a $90\%$ confidence interval for average height.
  • Can you think of any reasons that our sample might be suspect?

Solution

Recall that confidence intervals for means have the form

$$[\bar{x}-z^* \sigma/\sqrt{n}, \bar{x}+z^* \sigma/\sqrt{n}].$$

We just need to identify each term. Most of the terms are given in the problem:

  • $\bar{x} = 5.83$
  • $\sigma = 0.398$
  • $n=39$

We can read the last term off of our normal table to find the $z^*$ multiplier or we can use are new calculator page to find that $$z^* \approx 1.64342123.$$ Thus, our interval is [5.72526, 5.9347].

An example with categorical data

Of the 39 responses to our beginning of class survey 2 folks were left handed.

  • Use this sample to write down a $98\%$ confidence interval for the proportion of people who are left handed.
  • Can you think of any reasons that our sample might be suspect?

Solution

Confidence intervals still have the same form but the way we compute the standard error is different. We have

$$\left[\hat{p}-z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p}+z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right],$$

where

  • $\hat{p} = 2/39 \approx 0.051282$
  • $n=39$
  • $z^* = 2.3213$ from our new calculator.

Thus, we get

$$[-0.0307, 0.1333] \text{ or } [0,0.1333].$$

Hypothesis tests

Finally, we've discussed hypothesis tests. Since it's so recent, let just jump into an example:

Suppose we wish to test the claim that the proportion of men who own cats is larger than 25% at the 0.005 significance level based on a sample of 85 men, 35 of whom were cat owners.

First off, this is a one-sided hypothesis test that could be written

  • $H_0: p=0.25$
  • $H_A: p>0.25.$

Computations

The $Z$-score can always be computed with a calculator. Of course, you can do it with Python as well:

from numpy import sqrt
phat = 35/85
p = 0.25
n = 85
se = np.sqrt(p*(1-p)/n)
(phat-p)/se
3.444233600968322

This is a really big $Z$-score so we'd certainly reject the null. The online HW wants a $p$-value to 4 digits and our new calculator tells us $$P(Z>3.4442336) = 0.999713659.$$ This our $p$-value is $1-0.999713659$ or $0.0003$ to four digits.