Next Wednesday is the official middle of the term so next Monday, we'll have an eight to ten problem MyOpenMath assessment before I figure out midterm grades. This week, our main objective is to get you ready!
We started the semester looking at data. You should understand the difference between
In fact, we stated that one simple characterization of statistics is
the study of how to collect, analyze, and draw conclusions from data.
Jumping ahead a bit, we've more recently been doing computations with data. You should understand the distinction between
as well as the concept of standard deviation for both of those. We'll review some of that a bit more in depth later.
Fundamentally, data comes from studies. You should understand the difference between
Data is often presented in a table. You should be able to look at a table and identify types of variables.
import pandas as pd
df = pd.read_csv('https://marksmath.org/data/100MeterTimes.csv')
df.tail(6)
year | place | athlete | Country | time | |
---|---|---|---|---|---|
39 | 2013 | 1 | Usain Bolt | Jamaica | 9.77 |
40 | 2013 | 2 | Justin Gatlin | USA | 9.85 |
41 | 2013 | 3 | Nesta Carter | Jamaica | 9.95 |
42 | 2015 | 1 | Usain Bolt | Jamaica | 9.79 |
43 | 2015 | 2 | Justin Gatlin | USA | 9.80 |
44 | 2015 | 3 | Trayvon Bromell | USA | 9.92 |
We model data with distributions - the most important of which is certainly the normal distribution.
You should be able to do fundamental computations with the normal distribution - not only for because this is useful in and of itself, but as we've seen, confidence intervals and hypothesis tests are built on top of these ideas.
Suppose that scores on an exam are normally distributed with a mean of 70 and a standard deviation of 8. What percentage of scores are less than 80?
Solution: The $Z$-score for 80 is
Z = (80-70)/8
Z
1.25
Looking this up in a normal table we find that just over $89\%$ of scores are less than 80.
You should understand that the normal distribution arises when aggregation operations (such as a sum or mean) are applied to a sampling process - regardless of whether the underlying data is normally distributed or not.
You should understand that the normal distribution arises as the result of a limiting process so that the approximation is only valid for a large enough sample size.
You should also be able to compute means and standard deviations that arise from this process. Note that a standard deviation that arises from a sampling process is often called the standard error.
The 39 responses to our beginning of class survey yielded an average height of 5.83 feet with a standard deviation of 0.398 feet.
Recall that confidence intervals for means have the form
$$[\bar{x}-z^* \sigma/\sqrt{n}, \bar{x}+z^* \sigma/\sqrt{n}].$$We just need to identify each term. Most of the terms are given in the problem:
We can read the last term off of our normal table to find the $z^*$ multiplier or we can use are new calculator page to find that $$z^* \approx 1.64342123.$$ Thus, our interval is [5.72526, 5.9347].
Of the 39 responses to our beginning of class survey 2 folks were left handed.
Confidence intervals still have the same form but the way we compute the standard error is different. We have
$$\left[\hat{p}-z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p}+z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right],$$where
Thus, we get
$$[-0.0307, 0.1333] \text{ or } [0,0.1333].$$Finally, we've discussed hypothesis tests. Since it's so recent, let just jump into an example:
Suppose we wish to test the claim that the proportion of men who own cats is larger than 25% at the 0.005 significance level based on a sample of 85 men, 35 of whom were cat owners.
First off, this is a one-sided hypothesis test that could be written
The $Z$-score can always be computed with a calculator. Of course, you can do it with Python as well:
from numpy import sqrt
phat = 35/85
p = 0.25
n = 85
se = np.sqrt(p*(1-p)/n)
(phat-p)/se
3.444233600968322
This is a really big $Z$-score so we'd certainly reject the null. The online HW wants a $p$-value to 4 digits and our new calculator tells us $$P(Z>3.4442336) = 0.999713659.$$ This our $p$-value is $1-0.999713659$ or $0.0003$ to four digits.