An archive the questions from Mark's Fall 2018 Stat 225.

A confidence interval for heights

Mark

(10 pts)

Recall that I’ve got a CSV file on our website that contains data from the CDC on the health of 20,000 U.S. adults. You can grab and display a small sample of that data as follows:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')
sample = df.sample(7)
sample
ID genhlth exerany hlthplan smoke100 height weight wtdesire age gender
10651 excellent 1 1 0 64 118 118 76 f
2042 excellent 1 1 0 62 165 165 88 m
8669 excellent 0 1 0 73 165 180 26 m
1115 fair 0 1 1 67 180 160 80 m
13903 good 0 1 1 71 170 170 64 m
11964 excellent 1 1 1 68 160 150 36 f
11073 very good 0 1 1 64 125 125 21 f

Note that this is a random sample; thus, your result will have the same structure but be a different sample.

Now, it’s easy to compute the mean of the heights of the whole dataset as follows:

df.height.mean()
# Out:
# 67.1829

You can do the same thing for the sample:

sample.height.mean()
# Out:
# 67.0

We can also compute the standard deviation of the whole group:

df.height.std()
# Out:
# 4.1259

Let’s use this information to test the process of finding confidence intervals!

The problem

Use your name to seed the random number generator using the output of the following code:

my_name = 'mark'
sum([10**i*ord(c) for i,c in enumerate(my_name)])

Then run the code above to obtain a random sample of size 100 and then produce a 90% confidence interval for the mean of the heights of the whole data set. Report back your results in an answer below and clearly indicate if your interval contains the actual mean or not.

joshua

First I find the data…

import pandas as pd
df =  pd.read_csv('https://www.marksmath.org/data/cdc.csv')

Then I seed it.

my_name = 'joshua'
sum([10**i*ord(c) for i,c in enumerate(my_name)])

Output #:10986716

Then I get a sample to work with.

from numpy.random import seed
seed(10986716)
sample = df.sample(100)

Then I got the correct multiplier (Z*)

from scipy.stats import norm
zz = norm.ppf(0.95)
zz
Output#:1.6448536269514722

Then I get a confidence interval

import numpy as np
m = sample['height'].mean()
se = sample['height'].std()/10
[m-zz*se,m+zz*se]
Output#:[66.55059674931451, 68.04940325068549]

The actual mean is 67.18, which is in my interval.

audrey

First, I think I’ll read in my data:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')

Then, I’ll seed my random number generator:

my_name = 'audrey'
sum([10**i*ord(c) for i,c in enumerate(my_name)])

# Output: 13235267

and grab a sample:

from numpy.random import seed
seed(13235267)
sample = df.sample(100)

Now, the correct z^* multiplier is:

from scipy.stats import norm
zz = norm.ppf(0.95)
zz

# Out: 1.6448536269514722

Thus, my confidence interval is:

m = sample['height'].mean()
se = sample['height'].std()/10
[m-zz*se,m+zz*se]
# Out: 
# [67.68235919920785, 68.95764080079213]

Note that the actual mean of the whole population of size 20000 is 67.18, which is not in my interval!!! :frowning:

megan

I read in my data

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')

Then I made my random number generator

my_name = 'megan'
sum([10**i*ord(c) for i,c in enumerate(my_name)])

#output = 1208419

and pulled a sample from it

from numpy.random import seed 
seed(1208419)
sample = df.sample(100)

The correct z^* multiplier is

from scipy.stats import norm
zz = norm.ppf(0.95)

#output = 1.6448536269514722

So my confidence interval is

import numpy as np
m = sample['height'].mean()
se = sample['height'].std()/10
[m-zz*se,m+zz*se]

#output = [66.543499622813, 67.836500377187]

Note that the actual mean of the whole population of size 20000 is 67.18, which is in my interval

vscala

Grabbing data and libraries:

from numpy.random import seed
from scipy.stats import norm
import numpy as np
import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')

Seeding data based on my name:

seed(sum([10**i*ord(c) for i,c in enumerate("vincent")]))

Setting variables sample size, accuracy, samples set, etc.

size = 100
accuracy = .90
spl = df.sample(size)
z = norm.ppf(accuracy + ((1-accuracy)/2.0))
sm = spl['height'].mean()
ss = spl['height'].std()/np.sqrt(size)

Finally the calculating range of mean with 90% accuracy

[sm-z*ss, sm+z*ss]

Which gives the output of [66.11328245700338, 67.62671754299663] in which the actual mean of 67.18 falls within.

dennis

First:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')

Second, seed RNG:

my_name = 'dennis'
sum([10**i*ord(c) for i,c in enumerate(my_name)])
output:  12672110

Third, grab sample:

from numpy.random import seed
seed(12672110)
sample = df.sample(100)
sample
17232 17233 good 0 0 1 72 185 160 21 m
14669 14670 good 1 1 1 68 187 150 74 f
4029 4030 good 1 1 1 72 250 200 40 m
12757 12758 excellent 0 1 1 67 150 140 26 f
1095 1096 very good 1 1 0 67 145 130 41 f
15009 15010 good 1 0 0 62 178 160 50 f
3871 3872 excellent 1 1 0 68 150 140 50 f

z* multiplier is:

from scipy.stats import norm
zz = norm.ppf(0.95)
zz = 1.6448536269514722

verify:

norm.cdf(zz)-norm.cdf(-zz)
0.8999999999999999

To calculate confidence interval:

m = sample['height'].mean()
se = sample['height'].std()/np.sqrt(100)
[m-zz*se,m+zz*se]
output: [66.49505498168163, 67.78494501831837]

actual population (20,000) mean = 67.18 => within interval.

btucker

Seed random number generator:

 my_name = 'ben'
 sum([10**i*ord(c) for i,c in enumerate(my_name)])
# Out: 12108

Next, import data:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')

Grab a sample of size 100:

from numpy.random import seed
seed(12108)
sample = df.sample(100)
sample

The z* multiplier is:

from scipy.stats import norm
zz = norm.ppf(0.95)
zz

# Out: 1.6448536269514722

So my confidence interval is:

import numpy as np
m = sample['height'].mean()
se = sample['height'].std()/np.sqrt(100)
[m-zz*se,m+zz]
# Out: [67.53547257497463, 69.98485362695148]

The actual mean of the population is 67.18, and not in my interval.

john

First I read in my data:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')

Then I seeded my random number generator:

my_name = 'john'
my_seed = sum([10**i*ord(c) for i,c in enumerate(my_name)])
#Output: 121616

Grabbed a sample:

from numpy.random import seed
seed(my_seed)
sample = df.sample(100)

The correct z* multiplier is:

from scipy.stats import norm
zz = norm.ppf(0.95)
zz
#Output: 1.6448536269514722

So my confidence interval is:

import numpy as np
m = sample['height'].mean()
se = sample['height'].std()/np.sqrt(20000)
[m-zz*se,m+zz]
#Output: [67.02619163150887, 68.71485362695147]

The actual mean of the population is 67.18, which is in my interval.

Garrett

First I will generate a seed for my name:

my_name = 'Garrett'
sum([10**i*ord(c) for i,c in enumerate(my_name)])

which gives: 128736441

I now will add my data:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')
sample = df.sample(7)
sample

I will now grab a sample of my data:

from numpy.random import seed
seed (128736441)
sample = df.sample(100)
sample

Now, the correct z∗ multiplier is:

from scipy.stats import norm
zz = norm.ppf(0.95)
zz

Out[8]: 1.6448536269514722

Thus my confidence interval is:

import numpy as np
m = sample['height'].mean()
se = sample['height'].std()/10
[m-zz*se,m+zz*se]

[66.96344910710829, 68.33655089289172]

Note that the actual mean of the whole population of size 20000 is 67.18, which is in my interval

Tripp

First I read my data

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')

Then I made my RNG

my_name = 'Tripp'
sum([10**i*ord(c) for i,c in enumerate(my_name)])
seed number = 1243724

Pulled a sample from it

from numpy.random import seed 
seed(1243724)
sample = df.sample(100)
sample

z* multiplier

from scipy.stats import norm
zz = norm.ppf(0.95)
zz

1.6448536269514722

Now I have a confidence interval:

import numpy as np
m = sample['height'].mean()
se = sample['height'].std()/10
[m-zz*se,m+zz*se]

[66.27885468640896, 67.76114531359103]

Note the actual mean of the population is 67.18, which is in my interval.

Rebecca

I’m able to tell the computer to go find the data and seed the random generator thingie and, I think?, grab a sample:

my_name='rebecca'
sum([10**i*ord(c) for i,c in enumerate(my_name)])
import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')
from numpy.random import seed
seed(108001924)
sample = df.sample(100)

Okay, so then I can ask for the sample mean…

sample.height.mean()

which is 66.95

and the sample standard deviation

sample.height.std()

which is 4.384027482771633

For some reason, we’ve agreed to find the z^* multiplier thusly:

from scipy.stats import norm
zz = norm.ppf(0.95)
zz

which spits out 1.6448536269514722.

Apparently, I can find my confidence interval by doing this:

m = sample['height'].mean()
se = sample['height'].std()/10
[m-zz*se,m+zz*se]

Which, in turn, spat out

[66.22889164943082, 67.67110835056918]

The actual mean, 67.18, does fall in that interval.

goodmorning

First i got my data from the the cdc database

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')

then in order to get my seed number i enumerated my name

my_name = 'oscar'
sum([10**i*ord(c) for i,c in enumerate(my_name)])

which gave me a result of 1248161
then i generated a sample with my seed

from numpy.random import seed
seed(1248161)
sample = df.sample(100)

we then use the data to get the z* value
of 1.6448536269514722

from scipy.stats import norm
zz = norm.ppf(0.95)
zz

Using the z* value i calculate my interval

m = df['height'].mean()
se = df['height'].std()/10
[m-zz*se,m+zz*se]

which gives me
[66.50424091290787, 67.86155908709213] the actual mean of 67.18 is within my interval

mac

Get the data:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')

enumerate my name to get the seed data

my_name = 'mac'
sum([10**i*ord(c) for i,c in enumerate(my_name)])

The result was: 10979
I generated my sample with my seed

from numpy.random import seed
seed(10979)
sample = df.sample(100)

I used that data to get the z* value of 1.6448536269514722

from scipy.stats import norm
zz = norm.ppf(0.95)
zz

I used this z* to calculate my confidence interval

m = sample['height'].mean()
se = sample['height'].std()/10
[m-zz*se,m+zz*se]

Which returned:

[66.28356013030313, 67.73643986969688]

The actual mean of 67.18 does fall in that interval.

dpulse

First I read the data…

          import pandas as pd
          df = pd.read_csv("https://www.marksmath.org/data/cdc.csv")

Seed it:

        from numpy.random import seed
        my_name = "David"
        sum([10**i*ord(c) for i,c in enumberate(my_name)])

Then i get a Sample:

       from numpy.random import seed
       seed(11524325)
       sample = df.sample(100)
       sample

Get the correct multiplier Z*

        from scipy.stats import norm
        zz = norm.ppf(.95)
        zz = 1.6448

Get the Confidence Interval:

        import numpy as np
        m = sample["height"].mean()
        se = sample["height"].std()/np.sqrt(100)
        [m-zz*se,m+zz*se]
        output: [67.3, 68.69]

my mean was 67.39, which was within my interval