Stat 225 Transition

To continuous distritbutions, confidence interals, and hypothesis testing

Tue, Feb 13, 2024

What we know now

We know some probability theory,
We know what discrete random variables are,
We know what their distributions are -
- especially the binomial distribution!

Probability theory

When we say probability theory, we understand that

We’re talking about a sample space \(\Omega\) together with a function \(P\) that assigns probabilities to subsets of \(\Omega\).
We understand that \(P\) should satisfy
- \(P(\Omega) = 1\),
- \(P(\varnothing) = 0\), and
- For disjoint subsets \(A_i \subset \Omega\), \[P\left(\bigcup_{i} A_i\right) = \sum_i P(A_i).\]
- We know how to compute probability using tools like
  - Independence and mutual exclusivity,
  - Counting techniques, and
  - Conditional probability.

We are probably most comfortable with finite sample spaces - for example, the set of possible draws from a deck of cards.

Random variables

We understand that

A random variable is a real valued function defined on a sample space, often denoted \(X\) or \(Y\) or some such.
We don’t compute the value of a random variable, rather it takes on values randomly according to some distribution.
A random variable defined on a discrete sample space (like a finite set or the set of integers) is called a discrete random variable.
The distribution of a discrete random variable can be defined by assigning a non-negative mass to each element of the sample space. We might denote that by \(P(X=i)\).
We can that probability computation to subsets via \[P(X \in A) = \sum_{i\in A} P(X = i).\]

The binomial distribution

The binomial distribution with weight \(p\) is a specific discrete distribution defined on \(\{0,1,2,\ldots,n\}\). If \(X\) is binomially distributed, then \[P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}.\] If \(B_i\) represents the outcome of a trial (\(B_i=1\) for success and \(B_i=0\) for failure), then \[X = \sum_{i=1}^n B_i\] is binomially distributed. That is, the binomial distribution represents the probability of \(k\) successes in \(n\) trials.

Visualizing the binomial distribution

It’s worth having a picture of the binomial distribution in mind:

Where we’re going from here

Continuous distributions
- especially the normal distribution
Fitting the normal distribution to the binomial
Applying those ideas to the analysis of data
Learning how to do these things on the computer.

The binomial and the normal

As it turns out, a binomial distribution is very well approximated by a continuous distribution, called the normal distribution, as long as it has the same mean and standard deviation. That’s great because it means that we can infer information about one from the other.

Data

We would like to develop skills in manipulating data. Python makes it easy!

Code

import pandas as pd
df = pd.read_csv('https://marksmath.org/data/cdc.csv')
df

	genhlth	exerany	hlthplan	smoke100	height	weight	wtdesire	age	gender
0	good	0	1	0	70	175	175	77	m
1	good	0	1	1	64	125	115	33	f
2	good	1	1	1	60	105	105	49	f
3	good	1	1	0	66	132	124	42	f
4	very good	0	1	0	61	150	130	55	f
...	...	...	...	...	...	...	...	...	...
19995	good	1	1	0	66	215	140	23	f
19996	excellent	0	1	0	73	200	185	35	m
19997	poor	0	1	0	65	216	150	57	f
19998	good	1	1	0	67	165	165	81	f
19999	good	1	1	1	69	170	165	83	m

20000 rows × 9 columns

Graphics

Plotting a histogram is easy with Pandas/Python.

Code

df.hist(column = ['height'], bins = 20, grid=False, edgecolor='black', figsize = (12,6));

More graphics

With a little more work, we can illustrate how the histogram is tied to data and the normal distribution

Code

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Extract 2000 rows
df = df.iloc[0:2000]
# and plot
df.hist(column = ['height'], grid=False, 
    edgecolor='black', figsize = (12,6), density=True);

# Plot points below the histogram
z = df.height.apply(lambda z: -0.001)
plt.plot(df.height.apply(lambda x: x+np.random.random()), z, 'ok', alpha = 0.02)

# Compute and plot the normal pdf
m = df.height.mean()
s = df.height.std()
x = np.linspace(50,85,500)
y = norm.pdf(x,m,s)
plt.plot(x,y, '-');

Why?

Why would we do all this?

Because we’re using the normal distribution to model reality!

Continuous random variables

A continuous random variable is one that can take on (at least, in principle) a continuous range of real numbers. Here are a few examples.:

Example A: Find somebody and choose \(X\) to be a very precise measure of their height,
Example B: Randomly choose a college and choose \(X\) to be the average salary of all the professors.
Example C: Let \(X\) be the average margin of victory in all Super Bowls with Taylor Swift in attendance as of the year 2050.

The tricky thing is trying to figure out how to describe the distribution of a continuously distributed random variable.

The uniform distribution

The continuous, uniform distribution is probably the simplest example of a continuous distribution in general. Suppose I pick a real number between \(-1\) and \(1\) completely (and uniformly) at random. What does that even mean?

I suppose that the probability that the number lies in the left half of the interval (i.e. to the left of zero) should be equal to the probability that the number lies in the right half. Phrased in terms of a probability function applied to events, we might write

\[P(X<0) = P(X>0) = \frac{1}{2}.\]

The uniform distribution (cont)

Pushing the previous slide a bit further, suppose we choose a number uniformly at random from an interval \(I\). The probability of picking a number from a sub-interval should be proportional to the length of that sub-interval.

If the big interval \(I\) has length \(L\) and the sub-interval \(J\) has length \(\ell\), then I guess we should have

\[P(X \text{ is in } J) = \ell/L.\]

Example

Suppose we pick a number \(X\) uniformly at random from the interval \([-10,10]\). What is the probability that the number lies in the interval \([1,3]\)?

Solution: We simply divide the length of the sub-interval by the length of the larger interval to get

\[P(1<X<3) = \frac{2}{20} = 0.1.\]

Note that we’ve indicated the event using the inequality \(1<X<3\), as we will typically do.

Visualizing the continuous uniform distribution

A common way to visualize continuous distributions is to draw the graph of a curve in the top half of the \(xy\)-plane. The probability that a random value \(X\) with that distribution lies in an interval is then the area under the curve and over that interval. This curve is often called the density function of the distribution.

In the case of the uniform distribution over an interval \(I\), the “curve” is just a horizontal line segment over the interval \(I\) at the height 1 over the length of \(I\). In the picture below, for example, \(I=[0,2]\). On the left, we see just the density function for the uniform distribution over \(I\). On the right, we see the area under that density function and over the interval \([0.5,1]\). The area is \(1/4\) since \[P(0.5<X<1) = 1/4.\]

A distribution as a limit

There should be an obvious relationship between the continuous uniform distribution and the discrete uniform distribution. The picture below illustrates this relationship and also helps us see how a continuous distribution might arise as a limit of discrete distributions.

Probability density functions

One common way to define a continuous distribution is as the definite integral against a non-negative function of total integral 1. That is, \(f\) should satisfy

\(f(x) \geq 0\) for all \(x\in\mathbb R\) and
\(\displaystyle \int_{-\infty}^{\infty} f(x) \ dx = 1.\)

Then, if \(X\) is a random variable with distribution \(f\), we have \[P(a<X<b) = \int_a^b f(x) \ dx.\]

To generate the uniform distribution over an interval from \(x=A\) to \(x=B\), for example, we could define \[f(x) = \begin{cases} 1/(B-A) & A<x<B \\ 0 & \text{else}.\end{cases}\]

Another example

As another simple example, let’s take \[f(x) = \begin{cases} \frac{3}{10}(x^2+1)(2-x) & 0<x<2 \\ 0 & \text{else}.\end{cases}\] It’s not too hard to show that \[\int_{-\infty}^{\infty} f(x) \ dx = \int_0^2 \frac{3}{10}(x^2+1)(2-x) \ dx = 1.\] If \(X\) has distribution \(f\) and we want to know, for example, \(P(\frac{1}{2}<X<1)\), we can compute directly: \[P(\frac{1}{2} < X < 1) = \int_{1/2}^1 \frac{3}{10}(x^2+1)(2-x) \ dx = \frac{187}{640} = 0.2921875.\]

The exponential distribution

Many important continuous distributions are positive over an unbounded interval. The exponential distribution, for example, has the form

\[f_{\lambda}(x) = \begin{cases} \lambda e^{-\lambda x} & x > 0 \\ 0 & \text{else}.\end{cases}\]

Note that the exponential distribution depends on a parameter, \(\lambda\). No matter what the value of \(\lambda\), though, we have

\[\int_{-\infty}^{\infty} f_{\lambda}(x) \ dx = \int_0^{\infty} \lambda e^{-\lambda x} \ dx = 1.\]

The larger \(\lambda\), the more concentrated the distribution is near zero:

PDF vs CDF

Probability and statistics are rife with acronyms. Here are a couple of important ones:

PDF or Probability Density Function
CDF or Cumulative Density Function

You might see Density replaced with Distribution; the preferred reading can be context dependent. The essential difference between PDF and CDF, though, is that PDF is local, while CDF is cumulative. Put another way, the PDF \(f\) is the function that you integrate to get the CDF \(F\):

\[F(x) = \int_{-\infty}^x f(\chi) \ d\chi.\]

The the CDF should always be continuous and non-decreasing with \[\lim_{x\to -\infty} F(x) = 0 \: \text{ and } \: \lim_{x\to\infty} F(x) = 1.\]

Here’s the CDF for the exponential distribution with \(\lambda = 1\):

Mean and variance

I guess we already know mean and variance for discrete random variables:

\(\displaystyle \mu = \sum_{i} x_i \ p(x_i)\) and
\(\displaystyle \sigma^2 = \sum_{i} (x_i-\mu)^2 \ p(x_i)\).

There are analogous concepts for continuous random variables:

\(\displaystyle \mu = \int_{-\infty}^{\infty} x \ f(x) \ dx\)
\(\displaystyle \sigma^2 = \int_{-\infty}^{\infty} (x-\mu)^2 \ f(x) \ dx\).

As with discrete distributions, the variance is the square of the standard deviation. Both variance and standard deviation are important:

The standard deviation has the same units as the data
The variance is simpler algebraically
- For example, the sum of the variance of independent random variables is additive.

Uniform mean and variance

We can compute the mean and variance of the uniform distribution over the interval \([a,b]\) directly:

\[\mu = \int_a^b \frac{x}{b-a} \ dx = \frac{1}{2(b-a)} (b^2 - a^2) = \frac{1}{2}(a+b)\]

\[\sigma^2 = \int_a^b \frac{1}{b-a} \left(x - \frac{1}{2(a+b)}\right)^2 \ dx = \frac{1}{12}(a-b)^2.\]

Note that the mean is the midpoint of the interval, exactly as expected. The variance is small when the interval is small and increases with the size of the interval, also as expected.

Exponential mean and variance

The mean and variance of the exponential distribution can also be determined using integration by parts:

\[\mu = \int_0^{\infty} \lambda x e^{-\lambda x} \ dx = \frac{1}{\lambda}\] \[\sigma^2 = \int_0^{\infty} \lambda (x-1/\lambda)^2 e^{-\lambda x} \ dx = \frac{1}{\lambda^2}\]

As \(\lambda\) increases, the distribution becomes more concentrated near zero; thus, the mean and variance both decrease to zero.

Which integration techniques do we need to know??

We all need to know basic integration techniques up to and including \(u\)-substitution.
- Changing bounds of integration in \(u\)-subs is of particular importance.
We’ll need to understand improper integrals, since the most important continuous distributions are defined over unbounded intervals.
It’s worth knowing about integration by parts but it’s unlikely that specific IBP problems would occur on a quiz or exam.
It’s also worth knowing about numerical integration
- Probabilities computed with the normal distribution are computed with numerical techniques.

Probablistic problems

Let’s make sure we know how problems stated in the language of probability theory might look.

To be clear, you can find these on our MyOpenMath course.

Uniform example

Suppose that waves crash onto the beach every 5.8 seconds. More precisely, the time from when a person arrives at the shoreline until a crashing wave is observed follows a Uniform distribution from 0 to 5.8 seconds.

What is the mean of this distribution?
What is the standard deviation?
What’s the probability that a wave will crash on the beach within 3.14 seconds of a person’s arrival?
What’s the probability that a wave will crash on the beach at exactly 3.14 seconds of a person’s arrival?
What’s the probability that it will take more than 5 seconds from the time a person arrives until a wave crashes?

Exponential example

Suppose that a random variable \(X\) has an exponential distribution: \[f(x) = \lambda e^{-2.5 x},\] for \(x>0\).

Determine the value of \(\lambda\).
Compute \(P(X>0.3)\).
Compute the mean of \(X\).

HW

You’ve got problems like these on MyOpenMath due Thursday.