Mark
When I execute the following Python code:
import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/county_small.csv')
df.head(5)
I generate the following output:
state | name | FIPS | pop2010 | hs_grad | bachelors |
---|---|---|---|---|---|
Alabama | Autauga County | 1001 | 54571 | 85.3 | 21.7 |
Alabama | Baldwin County | 1003 | 182265 | 87.6 | 26.8 |
Alabama | Barbour County | 1005 | 27457 | 71.9 | 13.5 |
Alabama | Bibb County | 1007 | 22915 | 74.5 | 10.0 |
Alabama | Blount County | 1009 | 57322 | 74.7 | 12.5 |
Continuing further, I run the following code:
from numpy.random import seed
seed(75)
df.sample(5)
and get the following output:
state | name | FIPS | pop2010 | hs_grad | bachelors |
---|---|---|---|---|---|
Kentucky | Hart County | 21099 | 18199 | 67.7 | 9.2 |
North Carolina | Buncombe County | 37021 | 238318 | 87.2 | 31.2 |
Missouri | Polk County | 29167 | 31137 | 79.5 | 16.6 |
Texas | Lamar County | 48277 | 49793 | 82.4 | 17.4 |
Indiana | Whitley County | 18183 | 33292 | 90.2 | 16.6 |
- What are the cases?
- What are the variables?
- Classify each variable as numerical or categorical.
- Further classify each numerical variable as continuous or discrete.
- Further classify each categorical variable as nominal or ordinal.
- What is the purpose of the
head
command (i.e.df.head(5)
) in the first input? - What is the purpose of the
sample
command (i.e.df.sample(5)
) in the second input? - Suppose I wanted to use the 5 values that I see in either displayed table to extrapolate from the values of the
pop2010
column to estimate the average population of a randomly chosen US county.- What problems do you see with the samples?
- Do you have any reason to think that one sample would be better than the other?