(10 pts)
Recall that I’ve got a CSV file on our website that contains data from the CDC on the health of 20,000 U.S. adults. You can grab and display a small sample of that data as follows:
import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')
sample = df.sample(7)
sample
ID | genhlth | exerany | hlthplan | smoke100 | height | weight | wtdesire | age | gender |
---|---|---|---|---|---|---|---|---|---|
10651 | excellent | 1 | 1 | 0 | 64 | 118 | 118 | 76 | f |
2042 | excellent | 1 | 1 | 0 | 62 | 165 | 165 | 88 | m |
8669 | excellent | 0 | 1 | 0 | 73 | 165 | 180 | 26 | m |
1115 | fair | 0 | 1 | 1 | 67 | 180 | 160 | 80 | m |
13903 | good | 0 | 1 | 1 | 71 | 170 | 170 | 64 | m |
11964 | excellent | 1 | 1 | 1 | 68 | 160 | 150 | 36 | f |
11073 | very good | 0 | 1 | 1 | 64 | 125 | 125 | 21 | f |
Note that this is a random sample; thus, your result will have the same structure but be a different sample.
Now, it’s easy to compute the mean of the heights of the whole dataset as follows:
df.height.mean()
# Out:
# 67.1829
You can do the same thing for the sample:
sample.height.mean()
# Out:
# 67.0
We can also compute the standard deviation of the whole group:
df.height.std()
# Out:
# 4.1259
Let’s use this information to test the process of finding confidence intervals!
The problem
Use your name to seed the random number generator using the output of the following code:
my_name = 'mark'
sum([10**i*ord(c) for i,c in enumerate(my_name)])
Then run the code above to obtain a random sample of size 100 and then produce a 90% confidence interval for the mean of the heights of the whole data set. Report back your results in an answer below and clearly indicate if your interval contains the actual mean or not.