An archive the questions from Mark's Fall 2018 Stat 225.

Examining data

Mark

When I execute the following Python code:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/county_small.csv')
df.head(5)

I generate the following output:

state	name	FIPS	pop2010	hs_grad	bachelors
Alabama	Autauga County	1001	54571	85.3	21.7
Alabama	Baldwin County	1003	182265	87.6	26.8
Alabama	Barbour County	1005	27457	71.9	13.5
Alabama	Bibb County	1007	22915	74.5	10.0
Alabama	Blount County	1009	57322	74.7	12.5

Continuing further, I run the following code:

from numpy.random import seed
seed(75)
df.sample(5)

and get the following output:

state	name	FIPS	pop2010	hs_grad	bachelors
Kentucky	Hart County	21099	18199	67.7	9.2
North Carolina	Buncombe County	37021	238318	87.2	31.2
Missouri	Polk County	29167	31137	79.5	16.6
Texas	Lamar County	48277	49793	82.4	17.4
Indiana	Whitley County	18183	33292	90.2	16.6

What are the cases?
What are the variables?
Classify each variable as numerical or categorical.
Further classify each numerical variable as continuous or discrete.
Further classify each categorical variable as nominal or ordinal.
What is the purpose of the head command (i.e. df.head(5)) in the first input?
What is the purpose of the sample command (i.e. df.sample(5)) in the second input?
Suppose I wanted to use the 5 values that I see in either displayed table to extrapolate from the values of the pop2010 column to estimate the average population of a randomly chosen US county.
- What problems do you see with the samples?
- Do you have any reason to think that one sample would be better than the other?

joshua

Cases - the column labelled state
Variables - columns labelled name, FIPS, pop2010, hs_grad, and bachelors
name - categorical
FIPS - categorical
pop2010 - numerical
hs_grad - numerical
bachelors - numerical
pop2010 - discrete
hs_grad - continuous
bachelors - continuous
name - nominal
FIPS - ordinal
The purpose of the command “df.head(5)” is to show the first five rows of data.
The purpose of the sample command “df.sample(5)” is to get 5 random rows of data.
A. Problems with samples - the problem with the first group is that they are all from the same state and wouldn’t be an accurate representation of all of the counties in the US. Other possible problems are that there is a large variance in county sizes and populations, plus I would not use so few examples for an average that encompasses all of the US and its counties.
B. Reason to think a sample is better than another- It could be that one sample would be better than another because the first sample would be better at estimating the average population per county in just that state since they are all from that state. The second one is better because it has selections from multiple states giving it more variance.

megan

The state column labels the different cases. Each row is a different case.
The variables are the columns: names, FIPS, pop2010, hs_grad, bachelors
name - categorical
FIPS - numerical
pop2010 - numerical
hs_grad - numerical
bachelors - numerical
FIPS - discrete
pop2010 - discrete
hs_grad - continuous
bachelors - continuous
name - nominal
The purpose of the head command is to show the first 5 rows, or cases, of data
The purpose of the sample command is to pull a random 5 rows, or cases, of data
Problems I see with the samples - The first sample isn’t representative of the whole country because all the counties are in AL. Both samples are too small to get an accurate representation of the whole US.
Is one sample better than the other - The second sample is better because it is taken from 5 different states rather than just one, which makes it more representative of the entire population of the country.

Mark

@megan and @joshua I like megan’s answer better. I do see one issue. with it, though: FIPS is not a numerical variable. Rather, it’s a categorical, ordinal variable - a variable that looks numeric (since its value is a number) but is really used to put things into categories.

dennis

cases: state/county combination
Variables: FIPS, pop2010, hs_grad, bachelors
FIPS-categorical
pop2010 - numerical
hs_grad - numerical
bachelors - numerical
pop2010 - discreet
hs_grad - continuous
bachelors - continuous
state - nominal
name - nominal
FIPS - ordinal
to display the first 5 rows of the population, to keep the page clean for clarity.
tp pick 5 cases from the population
a) only 5 out of thousands of counties across the US
b) the second would be better as it’s randomly generated, vs. the 1st 5 cases of an alphabetised list.