Stat 185 - Computer lab 1

In this first computer lab, we'll just be exposed to this tool called Python that I've been using to illustrate a lot of the ideas in class.

Import

Python is a general purpose programming language. It's not designed to everything out of the box; rather, it's designed to be highly adaptable so that you can program it to do just about anything you want it to do. If you want to work in a specific domain, like statistics, you often need to import the appropriate libraries. Once you do that, then you have all sorts of functionality at your disposal.

Here are some of the main libraries that we'll typically import when doing statistics with Python:

In [ ]:
# Graphics
%matplotlib inline   
from matplotlib import pyplot as plt

# Numerical tools
import numpy as np

# Data analysis
import pandas as pd

Reading data

OK, let's move straight to getting some acutal data in our hands. It's super easy to read data into Python right over the internet using the Pandas library. Execute the following code:

In [ ]:
df = pd.read_csv('https://www.marksmath.org/data/county_small.csv')
print(len(df))
df.head()

Exercise:

  • What are the cases?
  • What do you think the variables represent?
  • Which variables are numerical and which are categorical?

Box plots and percentiles

OK, let's take a look at a box plot of the percentages of folks with bachelors in each county (ignoring outliers):

In [ ]:
df.boxplot('bachelors', vert=False, grid=False, showfliers=False);

Exercise: Based on that picture, what are the 25th percentile, the median, and the 75th percentile of this data?

You can check your answer with the following command:

In [ ]:
df.bachelors.describe()

Exercise: Why does the the previous result disagree with the picture on the maxium value? (You might try removing the showfliers=False portion of the computer code.)

Histograms

Here's a histogram:

In [ ]:
df.hist('bachelors', edgecolor='black', grid=False,
    bins=[0,10,20,30,40,50,60]
);

Exercise: Draw a histogram by hand using the following data: $$ 45,56,65,42,53,57 $$ Assume that the bins break at the multiples of 10.

You can use the following code to check your answer:

In [ ]:
plt.hist([45,56,65,42,53,57],
    edgecolor='black',
    bins=[40,50,60,70]
);

While we're at it, let's check the mean and standard deviation:

In [ ]:
[np.mean([45,56,65,42,53,57]),
 np.std([45,56,65,42,53,57], ddof=1)]

Exercise: write down the forumulae that produces these results!