An archive the questions from Mark's Fall 2018 Stat 225.

A confidence interval for heights

Mark

(10 pts)

Recall that I’ve got a CSV file on our website that contains data from the CDC on the health of 20,000 U.S. adults. You can grab and display a small sample of that data as follows:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')
sample = df.sample(7)
sample

ID	genhlth	exerany	hlthplan	smoke100	height	weight	wtdesire	age	gender
10651	excellent	1	1	0	64	118	118	76	f
2042	excellent	1	1	0	62	165	165	88	m
8669	excellent	0	1	0	73	165	180	26	m
1115	fair	0	1	1	67	180	160	80	m
13903	good	0	1	1	71	170	170	64	m
11964	excellent	1	1	1	68	160	150	36	f
11073	very good	0	1	1	64	125	125	21	f

Note that this is a random sample; thus, your result will have the same structure but be a different sample.

Now, it’s easy to compute the mean of the heights of the whole dataset as follows:

df.height.mean()
# Out:
# 67.1829

You can do the same thing for the sample:

sample.height.mean()
# Out:
# 67.0

We can also compute the standard deviation of the whole group:

df.height.std()
# Out:
# 4.1259

Let’s use this information to test the process of finding confidence intervals!

The problem

Use your name to seed the random number generator using the output of the following code:

my_name = 'mark'
sum([10**i*ord(c) for i,c in enumerate(my_name)])

Then run the code above to obtain a random sample of size 100 and then produce a 90% confidence interval for the mean of the heights of the whole data set. Report back your results in an answer below and clearly indicate if your interval contains the actual mean or not.

joshua

First I find the data…

import pandas as pd
df =  pd.read_csv('https://www.marksmath.org/data/cdc.csv')

Then I seed it.

my_name = 'joshua'
sum([10**i*ord(c) for i,c in enumerate(my_name)])

Output #:10986716

Then I get a sample to work with.

from numpy.random import seed
seed(10986716)
sample = df.sample(100)

Then I got the correct multiplier (Z*)

from scipy.stats import norm
zz = norm.ppf(0.95)
zz
Output#:1.6448536269514722

Then I get a confidence interval

import numpy as np
m = sample['height'].mean()
se = sample['height'].std()/10
[m-zz*se,m+zz*se]
Output#:[66.55059674931451, 68.04940325068549]

The actual mean is 67.18, which is in my interval.

audrey

First, I think I’ll read in my data:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')

Then, I’ll seed my random number generator:

my_name = 'audrey'
sum([10**i*ord(c) for i,c in enumerate(my_name)])

# Output: 13235267

and grab a sample:

from numpy.random import seed
seed(13235267)
sample = df.sample(100)

Now, the correct z^* multiplier is:

from scipy.stats import norm
zz = norm.ppf(0.95)
zz

# Out: 1.6448536269514722

Thus, my confidence interval is:

m = sample['height'].mean()
se = sample['height'].std()/10
[m-zz*se,m+zz*se]
# Out: 
# [67.68235919920785, 68.95764080079213]

Note that the actual mean of the whole population of size 20000 is 67.18, which is not in my interval!!!

megan

I read in my data

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')

Then I made my random number generator

my_name = 'megan'
sum([10**i*ord(c) for i,c in enumerate(my_name)])

#output = 1208419

and pulled a sample from it

from numpy.random import seed 
seed(1208419)
sample = df.sample(100)

The correct z^* multiplier is

from scipy.stats import norm
zz = norm.ppf(0.95)

#output = 1.6448536269514722

So my confidence interval is

import numpy as np
m = sample['height'].mean()
se = sample['height'].std()/10
[m-zz*se,m+zz*se]

#output = [66.543499622813, 67.836500377187]

Note that the actual mean of the whole population of size 20000 is 67.18, which is in my interval

vscala

Grabbing data and libraries:

from numpy.random import seed
from scipy.stats import norm
import numpy as np
import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')

Seeding data based on my name:

seed(sum([10**i*ord(c) for i,c in enumerate("vincent")]))

Setting variables sample size, accuracy, samples set, etc.

size = 100
accuracy = .90
spl = df.sample(size)
z = norm.ppf(accuracy + ((1-accuracy)/2.0))
sm = spl['height'].mean()
ss = spl['height'].std()/np.sqrt(size)

Finally the calculating range of mean with 90% accuracy

[sm-z*ss, sm+z*ss]

Which gives the output of [66.11328245700338, 67.62671754299663] in which the actual mean of 67.18 falls within.

dennis

First:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')

Second, seed RNG:

my_name = 'dennis'
sum([10**i*ord(c) for i,c in enumerate(my_name)])
output:  12672110

Third, grab sample:

from numpy.random import seed
seed(12672110)
sample = df.sample(100)
sample

17232	17233	good	0	0	1	72	185	160	21	m
14669	14670	good	1	1	1	68	187	150	74	f
4029	4030	good	1	1	1	72	250	200	40	m
12757	12758	excellent	0	1	1	67	150	140	26	f
1095	1096	very good	1	1	0	67	145	130	41	f
15009	15010	good	1	0	0	62	178	160	50	f
3871	3872	excellent	1	1	0	68	150	140	50	f

z* multiplier is:

from scipy.stats import norm
zz = norm.ppf(0.95)
zz = 1.6448536269514722

verify:

norm.cdf(zz)-norm.cdf(-zz)
0.8999999999999999

To calculate confidence interval:

m = sample['height'].mean()
se = sample['height'].std()/np.sqrt(100)
[m-zz*se,m+zz*se]
output: [66.49505498168163, 67.78494501831837]

actual population (20,000) mean = 67.18 => within interval.

btucker

Seed random number generator:

 my_name = 'ben'
 sum([10**i*ord(c) for i,c in enumerate(my_name)])
# Out: 12108

Next, import data:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')

Grab a sample of size 100:

from numpy.random import seed
seed(12108)
sample = df.sample(100)
sample

The z* multiplier is:

from scipy.stats import norm
zz = norm.ppf(0.95)
zz

# Out: 1.6448536269514722

So my confidence interval is:

import numpy as np
m = sample['height'].mean()
se = sample['height'].std()/np.sqrt(100)
[m-zz*se,m+zz]
# Out: [67.53547257497463, 69.98485362695148]

The actual mean of the population is 67.18, and not in my interval.

john

First I read in my data:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')

Then I seeded my random number generator:

my_name = 'john'
my_seed = sum([10**i*ord(c) for i,c in enumerate(my_name)])
#Output: 121616

Grabbed a sample:

from numpy.random import seed
seed(my_seed)
sample = df.sample(100)

The correct z* multiplier is:

from scipy.stats import norm
zz = norm.ppf(0.95)
zz
#Output: 1.6448536269514722

So my confidence interval is:

import numpy as np
m = sample['height'].mean()
se = sample['height'].std()/np.sqrt(20000)
[m-zz*se,m+zz]
#Output: [67.02619163150887, 68.71485362695147]

The actual mean of the population is 67.18, which is in my interval.

Garrett

First I will generate a seed for my name:

my_name = 'Garrett'
sum([10**i*ord(c) for i,c in enumerate(my_name)])

which gives: 128736441

I now will add my data:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')
sample = df.sample(7)
sample

I will now grab a sample of my data:

from numpy.random import seed
seed (128736441)
sample = df.sample(100)
sample

Now, the correct z∗ multiplier is:

from scipy.stats import norm
zz = norm.ppf(0.95)
zz

Out[8]: 1.6448536269514722

Thus my confidence interval is:

import numpy as np
m = sample['height'].mean()
se = sample['height'].std()/10
[m-zz*se,m+zz*se]

[66.96344910710829, 68.33655089289172]

Note that the actual mean of the whole population of size 20000 is 67.18, which is in my interval

Tripp

First I read my data

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')

Then I made my RNG

my_name = 'Tripp'
sum([10**i*ord(c) for i,c in enumerate(my_name)])
seed number = 1243724

Pulled a sample from it

from numpy.random import seed 
seed(1243724)
sample = df.sample(100)
sample

z* multiplier

from scipy.stats import norm
zz = norm.ppf(0.95)
zz

1.6448536269514722

Now I have a confidence interval:

import numpy as np
m = sample['height'].mean()
se = sample['height'].std()/10
[m-zz*se,m+zz*se]

[66.27885468640896, 67.76114531359103]

Note the actual mean of the population is 67.18, which is in my interval.

Rebecca

I’m able to tell the computer to go find the data and seed the random generator thingie and, I think?, grab a sample:

my_name='rebecca'
sum([10**i*ord(c) for i,c in enumerate(my_name)])
import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')
from numpy.random import seed
seed(108001924)
sample = df.sample(100)

Okay, so then I can ask for the sample mean…

sample.height.mean()

which is 66.95

and the sample standard deviation

sample.height.std()

which is 4.384027482771633

For some reason, we’ve agreed to find the z^* multiplier thusly:

from scipy.stats import norm
zz = norm.ppf(0.95)
zz

which spits out 1.6448536269514722.

Apparently, I can find my confidence interval by doing this:

m = sample['height'].mean()
se = sample['height'].std()/10
[m-zz*se,m+zz*se]

Which, in turn, spat out

[66.22889164943082, 67.67110835056918]

The actual mean, 67.18, does fall in that interval.

goodmorning

First i got my data from the the cdc database

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')

then in order to get my seed number i enumerated my name

my_name = 'oscar'
sum([10**i*ord(c) for i,c in enumerate(my_name)])

which gave me a result of 1248161
then i generated a sample with my seed

from numpy.random import seed
seed(1248161)
sample = df.sample(100)

we then use the data to get the z* value
of 1.6448536269514722

from scipy.stats import norm
zz = norm.ppf(0.95)
zz

Using the z* value i calculate my interval

m = df['height'].mean()
se = df['height'].std()/10
[m-zz*se,m+zz*se]

which gives me
[66.50424091290787, 67.86155908709213] the actual mean of 67.18 is within my interval

mac

Get the data:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')

enumerate my name to get the seed data

my_name = 'mac'
sum([10**i*ord(c) for i,c in enumerate(my_name)])

The result was: 10979
I generated my sample with my seed

from numpy.random import seed
seed(10979)
sample = df.sample(100)

I used that data to get the z* value of 1.6448536269514722

from scipy.stats import norm
zz = norm.ppf(0.95)
zz

I used this z* to calculate my confidence interval

m = sample['height'].mean()
se = sample['height'].std()/10
[m-zz*se,m+zz*se]

Which returned:

[66.28356013030313, 67.73643986969688]

The actual mean of 67.18 does fall in that interval.

dpulse

First I read the data…

          import pandas as pd
          df = pd.read_csv("https://www.marksmath.org/data/cdc.csv")

Seed it:

        from numpy.random import seed
        my_name = "David"
        sum([10**i*ord(c) for i,c in enumberate(my_name)])

Then i get a Sample:

       from numpy.random import seed
       seed(11524325)
       sample = df.sample(100)
       sample

Get the correct multiplier Z*

        from scipy.stats import norm
        zz = norm.ppf(.95)
        zz = 1.6448

Get the Confidence Interval:

        import numpy as np
        m = sample["height"].mean()
        se = sample["height"].std()/np.sqrt(100)
        [m-zz*se,m+zz*se]
        output: [67.3, 68.69]

my mean was 67.39, which was within my interval