Point Estimation

A major objective of statistics is to use samples to draw conclusions about a population. The simplest example of this is point estimation.

Example

Recall that I've got a data set taken from the CDC's Behavioral Risk Factor Surveillance System. The first few lines look something like so:

In [1]:
import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')
print(len(df.height))
df.head()
20000
Out[1]:
Unnamed: 0 genhlth exerany hlthplan smoke100 height weight wtdesire age gender
0 1 good 0 1 0 70 175 175 77 m
1 2 good 0 1 1 64 125 115 33 f
2 3 good 1 1 1 60 105 105 49 f
3 4 good 1 1 0 66 132 124 42 f
4 5 very good 0 1 0 61 150 130 55 f

This is a sample of 20,000 adults taken from the US population. We can compute the average height of folks taken from the sample as follows:

In [2]:
df['height'].mean()
Out[2]:
67.1829

Or, we might compute the proportion of people who exercise some.

In [3]:
df[df['exerany']==1].shape[0]/20000
Out[3]:
0.7457

These sample statistics are point estimates of the corresponding population parameters. Key qeustins are - how good are they and can they be improved?

Sampling as a random variable

When we sample from a population we generate a statistic - a number approximating an actual population parameter. If we do it again, we get a different sample and a different result. In this sense, sampling produces a random variable.

In this context, we'll often write $\theta$ for the population parameter and $\hat{\theta}$ for the random variable that arises from the sampling estimate.

Bias

We say that $\theta$ is unbiased if $E(\hat{\theta})= \theta$. Ideally, we should always use an unbiased estimator.

Theorem: Sample proportions and sample means are unbiased estimators.

Some biased estimators

1) Suppose we want to know the maximum height in a population. We might pick a large random sample and select the maximum height from the sample as our estimate. While this is an estimate, it is certaily biased because it always produces under-estimates.

If the sample size is $n$, and the maximum from the sample is $\hat{\theta}$ it might be more natural to use $$\frac{n+1}{n}\hat{\theta}.$$

2) Suppose we want to estimate the variance of the population heights. If $X_1,X_2,\ldots,X_n$ form a random sample with mean $\overline{X}$, I suppose that one natural choice would be $$ \frac{1}{n} \sum_{i=1}^n \left(X_i - \overline{X}\right)^2. $$ It turns out thay this is a bit too small because we're using $\overline{X}$ rather than the actual population mean and the $X_i$s are generally closer to the sample mean than the population mean. A way to fix it is to use $$ \frac{1}{n-1} \sum_{i=1}^n \left(X_i - \overline{X}\right)^2, $$ which is slightly larger. This is exactly what we call the sample variance and there is a proof in our text that this is unbiased.

Minimum variance

In addition to eliminating bias, we would like to minimize the variance of our estimator.

Example:

1) As already stated, if $X_1,X_2,\ldots,X_n$ form a random sample from a population with mean $\mu$ then the sample mean $\overline{X}$ is an unbiased estimator of $\mu$.

2) In addition, if the distribution is symmetric, then the sample median $\tilde{X}$ is also an unbiased estimator of $\mu$.

But, the sample mean has smaller variance.

In fact, if the distribution is normally distributed, then the sample mean is the unique minimum variance unbiased estimator or MVUE.

Here's a computation that compares the variance in the mean vs the median when drawing from a normal distribution. Generally, the variance of the mean is a bit smaller.

In [4]:
from scipy.stats import norm, cauchy
from numpy import mean, median, var

# Generate 100 samples of size 100
dd = [norm.rvs(0,1,100) for i in range(100)]
[
  var([mean(d) for d in dd]),  # Variance of the mean
  var([median(d) for d in dd]) # Variance of the median
]
Out[4]:
[0.009782403547020308, 0.015100659268001627]

Note, however that this depends on the distribution. Here's same computation for the Cauchy distribution. The variance of the mean is suddenly HUGE.

In [5]:
dd = [cauchy.rvs(0,1,100) for i in range(100)]
[
  var([mean(d) for d in dd]),
  var([median(d) for d in dd])
]
Out[5]:
[187.56530912835547, 0.023172593883594394]

Standard error

Standard error is the standard deviation that arises from a sampling random variable. Since variance is additive, we have

$$ \text{var}\left(X_1+X_2+\cdots+X_n\right) = n\sigma^2, $$

whenever the $X_i$s are i.i.d. with variance $\sigma^2$. Thus,

$$ \text{sd}\left(X_1+X_2+\cdots+X_n\right) = \sqrt{n}\sigma $$

so

$$ \text{sd}\left(\frac{X_1+X_2+\cdots+X_n}{n}\right) = \frac{\sigma}{\sqrt{n}}. $$

Similarly, the standard error associated with sample proportion is

$$\sqrt{\frac{p(1-p)}{n}},$$

where $p$ is the true sample proportion.

Generally, a smaller standard error means our estimate is likely to be closer to the true value. Note also that, standard error decreases as sample size increases.

Probability plots

A normal probability plot is a useful tool for assessing normality. Here’s the basic idea: Suppose we have some sample data consisting of $n$ numerical values. We’ll make a plot with $n$ points - one point for each of the $n$ values. The vertical component for a given data point is just the value of the data point itself. The horizontal component for that given data point is determined by the $z$-score of that percentile.

For example, $z=1$ is greater than about 84% of all values produced by the normal distribution. You can look that up in a normal table or use the following Python code:

In [6]:
norm.cdf(1)
Out[6]:
0.8413447460685429

Furthermore, the $84^{\text{th}}$ quantile for the men’s CDC height data is 72 inches, as the following computation reveals.

In [7]:
df['height'].quantile(0.84)
Out[7]:
72.0

Here's the normal probability plot for the heights with the point $(0.84,72)$ shown in yellow.

In [8]:
%matplotlib inline
from scipy.stats import probplot
import matplotlib.pyplot as plt

probplot(df.height, plot=plt);
plt.plot([0.84],[72], 'oy', markersize=9)
Out[8]:
[<matplotlib.lines.Line2D at 0x1a195ee780>]

The fact that this graph is just about a straight line indicates that our data is close to normally distributed.

On the other hand weights are not generally normally distributed.

In [9]:
probplot(df.weight, plot=plt);

Obtaining a confidence interval

A point estimate is a single number that is supposed to be an approximation to some population parameter. By itself, though, a point estimate gives no indication as to how good an approximation it is. We can use the standard deviation of the sample random variable to obtain information on the quality of the point estimate.

Here's the idea: Suppose we have a sample mean:

$$ \overline{X} = \frac{X_1+X_2+\cdots+X_n}{n}. $$

By the central limit theorem, we expect $\overline{X}$ to be approximately normal. Furthermore, its standard deviation is:

$$ \text{sd}\left(\frac{X_1+X_2+\cdots+X_n}{n}\right) = \frac{\sigma}{\sqrt{n}}. $$

Using the properties of the normal, I guess we would expect the true mean of the population to be very likely to lie in the interval

$$ [\overline{X}-2\frac{\sigma}{\sqrt{n}}, \overline{X}+2\frac{\sigma}{\sqrt{n}}. $$

How likely? I guess 95\% likely! More generally,

$$ [\overline{X}-z^*\frac{\sigma}{\sqrt{n}}, \overline{X}+z^*\frac{\sigma}{\sqrt{n}}. $$

Represents a confidence interval whose confidence level depend on $z^*$

Example

Find a 99\% confidence interval for the average height of an average adult using the CDC data.

Solution: First, let's figure out what $z^*$ should be:

In [10]:
from scipy.stats import norm
zz = norm.ppf(0.995)
zz
Out[10]:
2.5758293035489004

This says that 99\% of the area under the standard normal should lie over [-zz,zz]:

In [11]:
norm.cdf(zz)-norm.cdf(-zz)
Out[11]:
0.99

Good! Now we can write down the confidence interval:

In [12]:
import numpy as np
m = df['height'].mean()
se = df['height'].std()/np.sqrt(20000)
[m-zz*se,m+zz]
Out[12]:
[67.10775043110804, 69.7587293035489]