Stat 185¶

Intro to class and data basics¶

Statistics vs data science¶

Check out Wikipedia's article on Data Science. Quoting from that article, I guess that

Data science is "The Sexiest Job of the 21st Century" (Wikipedia cites the Harvard Business Review)
There could be a global shortage of 1.5 million data scientists (Wikipedia cites McKinsey & Company)

So...¶

just what is this hot, new field?¶

According to the same Wikipedia article, data science is a "concept to unify statistics, data analysis and their related methods" (Wikipedia cites a prominent text).
According to page 8 of our textbook, "Statistics is the study of how best to collect, analyze, and draw conclusions from data".
More blunlty, Nate Silver says that data science is "sexed-up term for statistics".

Why so hot?¶

Statistics (or data science, if you want) is hot for good reason. Data is becoming easier and easier to come by and it's impact more and more pervasive. Think

Medicine
Politics
Sports
Tech (Your car knows when you gain weight)

That's quite a variety of fields! Maybe that's why I recently stumbled on this article (written by a physician) asserting that statistics may be the most important class that you'll ever take.

What if I'm just an ordinary person?¶

What if you're not interested in being a techie or a doctor or anything like that? What if you just want to be an ordinary person?

You're extraordinary!¶

First off, if you complete your goal of obtaining a college degree, you're not exactly an "ordinary" person. By my estimates, barely 31% of US adults have a bachelor's degree or higher. Of course, only some fraction of those folks have taken a statistics class so, in a sense, you're already approaching the data elite!

Do you read the newspaper?¶

You really need some level of quantitative literacy in general and statistical literacy in particular to be an informed citizen these days. Politics these days provides all kinds of examples. Here are just a few examples:

Elections predictions from FiveThirtyEight
- 2018 House
- 2016 Presidential election race
Gerrymandering
- A general NYTimes article
- A more technical Wired article.
Is immigration linked to crime?
- A year old Marshall Project article
- A more recent NYTimes article
Health
- JAMA on walking from this CNN article
- CBD?
The NBA finals
- ESPN's predictions
- FiveThirtyEight's predictions
GoT ratings
- Chart
- from this NYTimes article

A look at some actual data¶

One major objective of this class to learn to deal with real world data. So let's start by taking a look at our class data.

Note that, as we go through these examples, we'll meet several important concepts from section 1.2 of our text whose title is Data Basics.

Getting and examining the data¶

I've collected the results of our class survey and the last few lines look something like so:

import pandas as pd
df = pd.read_csv('../records/results-survey538927.csv')
df = df.drop(['Date submitted', 'Access code','Last page', 'Start language', 'Seed'], axis=1)
#df['Where is your hometown?'] = df['Where is your hometown?'].apply(lambda s: 'looks like: 35.6,-82.55')
df.tail()

	Response ID	How old are you?	How tall are you? [Feet]	How tall are you? [Inches]	Are you left or right handed?	What is your gender (optional):	Choose your eye color	Choose your eye color [Other]	What is your major?	Where is your hometown?
34	37	19	5	9	Right	Male	Brown	NaN	Environmental Studies	42.6123;-71.41414
35	38	21	5	10	Right	Male	Blue	NaN	Computer science	35.59925;-82.56157
36	39	19	6	5	Right	Male	Brown	NaN	Management	37.55376;-77.46026
37	40	30	6	1	Right	Male	Hazel	NaN	Computer science	30.07994;-95.41716
38	41	19	6	2	Right	Male	Brown	NaN	Psychology	29.51372;-95.44922

That's actual data, though I've set everyone's hometown to Asheville for privacy purposes.

Data tables¶

The result of our import is called a data table or data frame.

Each row in a data table is called an observation and corresponds to a case in our study.

Each column corresponds to a variable or characteristic associated with the cases.

There are two main types of data and both types can be further classified into two sub-types.

Numerical data, which can be
- Discrete or
- Continuous
Categorical data, which can be
- Nominal or
- Ordinal

Histograms for numerical data¶

We can get and understanding of what numerical data looks like by examining a histogram.

df['height'] = [r[1]['How tall are you? [Feet]'] + r[1]['How tall are you? [Inches]']/12 for r in  df.iterrows()]
df.hist(['height','How old are you?'],grid=False, edgecolor='black', bins=5, figsize = (12,6));

Box plots for numerical data¶

Alernatively, we might look at box plot, that illustrates the so-called five-point summary of

min, 1st quartile, median, 3rd quartile, and max.

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
df.boxplot(['How old are you?'] , grid=False)
plt.subplot(1,2,2)
df.boxplot(['height'] , grid=False);

A pie chart for categorical data¶

Lots of people seem to love pie charts for categorical data. I only favor pie charts when comparing two values of a categorical variable.

df['What is your gender (optional):'].value_counts().plot.pie(figsize=(8,8));

A bar chart for categorical data¶

An better way to look at categorical data is with a bar chart, which makes it much easier to compare two values that are quite close to one another.

df['Choose your eye color'].value_counts().plot.bar(rot=0, figsize=(13,8));

Comparing variables¶

Sometimes we'll want to find if there's a relationship between two variables. In the following side-by-side boxplot, we examine how the relationshipe between gender and height.

import seaborn as sns
pic = sns.catplot(
    kind='box', x='What is your gender (optional):', y='height', data=df,
    aspect = 1.5)

Geographic data¶

Geographic data is of tremendous importance and is often best illustrated on a map. Here's what the actual answers to the hometown question looks like:

import folium
import pandas as pd
from numpy import random

map = folium.Map(location = [35.6, -82.6], zoom_start = 5)

random.seed(1)
def s_to_pt(s):
    pt  = s.split(';')
    return  [float(pt[0]) + (random.random()-0.5)/20, float(pt[1]) + (random.random()-0.5)/20]
gb = df.groupby('Where is your hometown?').count()
for pt in df['Where is your hometown?']:
    folium.Circle(s_to_pt(pt), 5000, fill=True, popup = pt).add_to(map)
map

Make this Notebook Trusted to load map: File -> Trust Notebook

The objectives of Statistics¶

One simple characterization of statistics is

the study of how to collect, analyze, and draw conclusions from data.

Note the three parts:

Collect: Using experiments or observational studies,
Analyze:
- Qualitatively - often using graphs
- Quantitatively - often using computations from probability theory
Draw conclusions: The process of inference

Collecting data¶

Collecting data often boils down to designing an experiment or observational study.

An observational study is one where the data collection does not interfere with how the data arises. Examples include surveys and reviews of records.
In an experiment, researchers actively work with the samples applying treatments and observing effects. Medical researchers, for example, might study the efficacy of a vaccine where a large random sample of patients is broken into a treatment group and non-treatment group; researchers then compare the outcomes of the two groups

We'll talk a lot more about getting data next time.

Analyzing data¶

The images we've seen today allow us to draw rough qualitative conclusions about the data we see. Quantiative analysis is more numerical. For example, 2 out of the 39 people in the class are left handed, or about 5.1%.

This is more precise information than we could glean from a pie chart.

Inference¶

Ultimately, we want to draw inferences or conclusions from data. To do so, it helps to have a little terminology:

Population refers to the complete set of entities under consideration,
Sample refers to some subset of the population,
Parameter refers to some summary characteristic of the population, and
Statistic refers to some summary characteristic computed from a sample.

The main question is: once you've computed a statistic from a sample, what might that tell you about the corresponding parameter for the whole population?

For example, I guess that 5.1% of our class is left handed. We could take that as an approximation to the number of left handed folks in the whole population. How accurate an approximation might we expect that to be?