Check out Wikipedia's article on Data Science. Quoting from that article, I guess that
Statistics (or data science, if you want) is hot for good reason. Data is becoming easier and easier to come by and it's impact more and more pervasive. Think
That's quite a variety of fields! Maybe that's why I recently stumbled on this article (written by a physician) asserting that statistics may be the most important class that you'll ever take.
What if you're not interested in being a techie or a doctor or anything like that? What if you just want to be an ordinary person?
First off, if you complete your goal of obtaining a college degree, you're not exactly an "ordinary" person. By my estimates, barely 31% of US adults have a bachelor's degree or higher. Of course, only some fraction of those folks have taken a statistics class so, in a sense, you're already approaching the data elite!
You really need some level of quantitative literacy in general and statistical literacy in particular to be an informed citizen these days. Politics these days provides all kinds of examples. Here are just a few examples:
One major objective of this class to learn to deal with real world data. So let's start by taking a look at our class data.
Note that, as we go through these examples, we'll meet several important concepts from section 1.2 of our text whose title is Data Basics.
I've collected the results of our class survey and the last few lines look something like so:
Response ID | How old are you? | How tall are you? [Feet] | How tall are you? [Inches] | Are you left or right handed? | What is your gender (optional): | Choose your eye color | Choose your eye color [Other] | What is your major? | Where is your hometown? | |
---|---|---|---|---|---|---|---|---|---|---|
28 | 31 | 19 | 5 | 3 | Right | Female | Blue | NaN | Computer Science | looks like: 35.6,-82.55 |
29 | 32 | 18 | 5 | 8 | Right | Female | Brown | NaN | Mass Communications | looks like: 35.6,-82.55 |
30 | 33 | 18 | 5 | 9 | Right | Female | Blue | NaN | undecided | looks like: 35.6,-82.55 |
31 | 34 | 19 | 5 | 4 | Right | Female | Hazel | NaN | health and wellness | looks like: 35.6,-82.55 |
32 | 35 | 20 | 5 | 7 | Right | Male | Blue | NaN | Political Science | looks like: 35.6,-82.55 |
That's actual data, though I've set everyone's hometown to Asheville for privacy purposes.
The result of our import is called a data table or data frame.
Each row in a data table is called an observation and corresponds to a case in our study.
Each column corresponds to a variable or characteristic associated with the cases.
There are two main types of data and both types can be further classified into two sub-types.
We can get and understanding of what numerical data looks like by examining a histogram.
Alernatively, we might look at box plot, that illustrates the so-called five-point summary of
min, 1st quartile, median, 3rd quartile, and max.
Lots of people seem to love pie charts for categorical data. I only favor pie charts when comparing two values of a categorical variable.
An better way to look at categorical data is with a bar chart, which makes it much easier to compare two values that are quite close to one another.
Sometimes we'll want to find if there's a relationship between two variables. In the following side-by-side boxplot, we examine how the relationshipe between gender and height.
Geographic data is of tremendous importance and is often best illustrated on a map. Here's what the actual answers to the hometown question looks like:
One simple characterization of statistics is
the study of how to collect, analyze, and draw conclusions from data.
Note the three parts:
Collecting data often boils down to designing an experiment or observational study.
We'll talk a lot more about getting data next time.
The images we've seen today allow us to draw rough qualitative conclusions about the data we see. Quantiative analysis is more numerical. For example, 2 out of the 33 people in the class are left handed, or about 6%.
This is more precise information than we could glean from a pie chart.
Ultimately, we want to draw inferences or conclusions from data. To do so, it helps to have a little terminology:
The main question is: once you've computed a statistic from a sample, what might that tell you about the corresponding parameter for the whole population?
For example, I guess that 6.66% of our class is left handed. We could take that as an approximation to the number of left handed folks in the whole population. How accurate an approximation might we expect that to be?