An archive the questions from Mark's Fall 2018 Stat 225.

Examining data

Mark

When I execute the following Python code:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/county_small.csv')
df.head(5)

I generate the following output:

state name FIPS pop2010 hs_grad bachelors
Alabama Autauga County 1001 54571 85.3 21.7
Alabama Baldwin County 1003 182265 87.6 26.8
Alabama Barbour County 1005 27457 71.9 13.5
Alabama Bibb County 1007 22915 74.5 10.0
Alabama Blount County 1009 57322 74.7 12.5

Continuing further, I run the following code:

from numpy.random import seed
seed(75)
df.sample(5)

and get the following output:

state name FIPS pop2010 hs_grad bachelors
Kentucky Hart County 21099 18199 67.7 9.2
North Carolina Buncombe County 37021 238318 87.2 31.2
Missouri Polk County 29167 31137 79.5 16.6
Texas Lamar County 48277 49793 82.4 17.4
Indiana Whitley County 18183 33292 90.2 16.6
  1. What are the cases?
  2. What are the variables?
  3. Classify each variable as numerical or categorical.
  4. Further classify each numerical variable as continuous or discrete.
  5. Further classify each categorical variable as nominal or ordinal.
  6. What is the purpose of the head command (i.e. df.head(5)) in the first input?
  7. What is the purpose of the sample command (i.e. df.sample(5)) in the second input?
  8. Suppose I wanted to use the 5 values that I see in either displayed table to extrapolate from the values of the pop2010 column to estimate the average population of a randomly chosen US county.
    • What problems do you see with the samples?
    • Do you have any reason to think that one sample would be better than the other?
joshua
  1. Cases - the column labelled state
  2. Variables - columns labelled name, FIPS, pop2010, hs_grad, and bachelors
  3. name - categorical
    FIPS - categorical
    pop2010 - numerical
    hs_grad - numerical
    bachelors - numerical
  4. pop2010 - discrete
    hs_grad - continuous
    bachelors - continuous
  5. name - nominal
    FIPS - ordinal
  6. The purpose of the command “df.head(5)” is to show the first five rows of data.
  7. The purpose of the sample command “df.sample(5)” is to get 5 random rows of data.
  8. A. Problems with samples - the problem with the first group is that they are all from the same state and wouldn’t be an accurate representation of all of the counties in the US. Other possible problems are that there is a large variance in county sizes and populations, plus I would not use so few examples for an average that encompasses all of the US and its counties.
    B. Reason to think a sample is better than another- It could be that one sample would be better than another because the first sample would be better at estimating the average population per county in just that state since they are all from that state. The second one is better because it has selections from multiple states giving it more variance.
megan
  1. The state column labels the different cases. Each row is a different case.
  2. The variables are the columns: names, FIPS, pop2010, hs_grad, bachelors
  3. name - categorical
    FIPS - numerical
    pop2010 - numerical
    hs_grad - numerical
    bachelors - numerical
  4. FIPS - discrete
    pop2010 - discrete
    hs_grad - continuous
    bachelors - continuous
  5. name - nominal
  6. The purpose of the head command is to show the first 5 rows, or cases, of data
  7. The purpose of the sample command is to pull a random 5 rows, or cases, of data
  8. Problems I see with the samples - The first sample isn’t representative of the whole country because all the counties are in AL. Both samples are too small to get an accurate representation of the whole US.
    Is one sample better than the other - The second sample is better because it is taken from 5 different states rather than just one, which makes it more representative of the entire population of the country.
Mark

@megan and @joshua I like megan’s answer better. I do see one issue. with it, though: FIPS is not a numerical variable. Rather, it’s a categorical, ordinal variable - a variable that looks numeric (since its value is a number) but is really used to put things into categories.

dennis
  1. cases: state/county combination
  2. Variables: FIPS, pop2010, hs_grad, bachelors
  3. FIPS-categorical
    pop2010 - numerical
    hs_grad - numerical
    bachelors - numerical
  4. pop2010 - discreet
    hs_grad - continuous
    bachelors - continuous
  5. state - nominal
    name - nominal
    FIPS - ordinal
  6. to display the first 5 rows of the population, to keep the page clean for clarity.
  7. tp pick 5 cases from the population
  8. a) only 5 out of thousands of counties across the US
    b) the second would be better as it’s randomly generated, vs. the 1st 5 cases of an alphabetised list.