Comparing Polls

Mon, Nov 04, 2024

Comparing Proportions in Political Polls

Last time, we talked about comparing data sets. Today, we’re going to apply some similar ideas to compare poll results between two or more people running for an election. This both a topic of intense current interest and topic that requires a bit more analysis.

This topic is not covered in our text but you can read about it in this PDF from ABC News.

Questions we hope to assess

The fundamental question that we hope to answer with the tools that we have is:

Is my candidate likely to win an election determined by popular vote?

When we say “likely to win”, we mean to within a pre-prescribed error tolerance, which we’ll typically take to be 95%.

Questions we will not assess

We might discuss the following questions a bit but we won’t use quantitative tools to really address them:

What’s the probability that a particular candidate will win a specific election?
How do we account for the electoral college in presidential elections?

Tools that we’ll use

Political polls
The statistical inference that we’ve learned to this point, including:
- Confidence intervals and
- Hypothesis tests

Political polls

First question is - where can you find reliable political poll data?

There is an absolute shit-ton of political prognostication out there! What should we look for to find reliable information?

Shit-ton, by the way, is a technical term indicating:

There’s a lot of it and
Much of it is of low quality

What to look for?

Here are a few things to look for when trying to find reliable election predictions:

Data
including historical data, demographics, and especially polls
Transparency
i.e., clear descriptions of methodology and data
Probabilistic statements
Because, that’s what we do in statistics!

Sites

Here are my two favorite sites for finding election predictions and polling data:

538
A site hosted by ABC news that focuses on data based political analysis
Silver Bulletin
A new site by the founder of 538

Originally stylized as FiveThirtyEight, ABC news’ 538 started originally as an independent blog by Nate Silver in 2008. After a couple of very successful years, FiveThirtyEight was purchased by the New York Times; FiveThirtyEight was moved to ESPN/ABC/Disney in 2013. While Nate Silver maintained control over the content during those moves, he finally left FiveThirtyEight after major restructuring at Disney in 2023. FiveThirtyEight was restylized as 538 and Nate Silver founded the Silver bulletin shortly thereafter.

Both sites remain as outstanding, data based news outlets.

Political polls

Neither 538 nor Silver Bulletin conduct polls; rather, they aggregate a bunch of other polls and make predictions based on those.

In addition, they provide ratings of those polls, explain the justifications of those ratings, and weight those polls in their analysis accordingly:

National poll table

Nate Silver provides the data that he uses in his analysis right on his election forecast page. If you go to the section titled “Polls included in our model”, you can find tables for national polls and individual swing states and there’s a “Download the Data” button. I used the national data, for example, to produce the following table:

national_poll_data = {
    let national_poll_data = await FileAttachment("data/NationalPolls-11-03-24.csv").csv();
    national_poll_data = national_poll_data.map(function(o) {
       const Dates = o.Dates.split('@')[0];
       const Sample = o.Sample.split(' ')[0];
       const Pollster = o.Pollster;
       const Harris = o.Harris;
       const Trump = o.Trump;
       return {Dates, Sample, Pollster, Harris, Trump}
    });
    return national_poll_data
}

Inputs.table(national_poll_data)

Histogram of differences

The data in the previous table is naturally paired. We can compute the pairwise differences of Harris’ percentage - Trump’s for recent polls and plot a histogram of the resulting data. It suggests that Harris leads trump by nearly two percentage points on average:

{
  let data = await FileAttachment("data/NationalPolls-11-03-24.csv").csv();
  data = data
    .filter((o) => parseInt(o.Dates) >= 10)
    .map(function (o) {
      o.diff = parseInt(o.Harris) - parseInt(o.Trump);
      return o;
    });

  return Plot.plot({
    x: { domain: [-5, 9] },
    y: { domain: [-2, 44] },
    marks: [
      Plot.rectY(
        data,
        Plot.binX({ y: "Count" }, { x: "diff", fill: "steelblue" })
      ),
      Plot.ruleY([0]),
      Plot.ruleX([d3.mean(data, (o) => o.diff)], { strokeDasharray: [5, 3] }),
      Plot.axisX({ y: 0 })
    ]
  });
  }

Examples of probabilistic predictions

In 2016, FiveThirtyEight gave Clinton a 71.6% chance of victory
Hardly a lock.
As of Nov 3, Silver Bulletin projects
- 51.5% chance of a Trump victory,
- 48.1% chance of a Harris victory, and
- 0.4% chance of an electoral tie
As of Nov 3, 538 states
- Trump wins 534 out of 1000 simulations,
- Harris wins 464 out of 1000 simulations, and
- there are 2 ties out of those 1000 simulations.

A closer look at one poll

Let’s take a close look at one particular poll, namely the AtlasIntel poll.

I chose this poll for several reasons:

It’s rated fairly highly by both 538 (2.7 out of 3) and Silver Bulletin (A),
They have very recent poll results for
- The nationwide popular presidential election,
- Swing state presidential elections, and
- Swing state Gubernatorial elections
Their summary data is available online.

You can find their general release polls here. I’m specifically using this summary data released on November 3rd.

Nationwide methodology

The AtlasIntel nationwide summary data is distributed as a 29 page PDF. Page 4 tells us the sample size and margin of error. It looks like so:

Nationwide percentages

On page seven, we see data that tells us what percentage of voters are planning on voting for each candidate:

Questions

How can we use this kind to estimate:

What percentage of votes will our candiates receive?
Who will win??

Margin of error

There is some redundancy in the information on the slides. The first one tells us

The sample size is \(N=2463\) and
The margin of error is \(ME = \pm 2\).

It turns out that we can deduce the margin of error from the sample size.

Definition

Recall that Margin of error is defined by

\[ ME = z^* \times \frac{\sqrt{p(1-p)}}{\sqrt{N}}. \]

In our case, we have \(z^* = 2\) for a 95% level of confidence. In addition, \(\sqrt{p(1-p)} \leq 1/2\) for all \(p\). Thus,

\[ ME \leq 2 \times \frac{1/2}{\sqrt{N}} = \frac{1}{\sqrt{2463}} \approx 0.02015. \]

That’s where the \(ME = \pm2%\) comes from!

A 95% confidence interval

Note that this indicates that 47.2% of voters plan to vote for Harris. Thus, a 95% confidence interval for her percentage of voters would be \[ [45.2\%, 49.2\%]. \] Of course, Trump has a similar interval centered at 49% and the two have significant overlap.

Generalization

More generally, the Margin of Error for a simple random sample of size \(N\) is always bounded by \[ ME = 1/\sqrt{N}, \] when working at a 95% level of confidence.

We can use this to determine how large our sample should be to obtain a desired margin of error. We simply need

\[ N > 1/ME^2. \]

Example

Suppose we’d like to ensure that our margin of error is less than 1.5%, i.e. \[ ME < 0.015. \] How large a sample do we need?

Solution: I guess we need

\[ N > 1/0.015^2. \] The following computation suggests that \(N=4,445\) should do:

1/0.015**2

4444.444444444444

A hypothesis test

Let’s run a hypothesis test to check whether this data suggests that Trump will win the popular vote at a 95% level of confidence. Thus, we take \(p_T\) denote the proportion of Trump voters and \(p_H\) denote the proportion of Harris voters. We then run the hypothesis test \[ H_0: p_H - p_T = 0 \\ H_A: p_H - p_T < 0. \]

Standard error for a multinomial proportion

There is yet another definition of standard error for comparing proportions that related in this fashion, namely:

\[ SE = \sqrt{\frac{p_1 + p_2 - (p_1 - p_2)^2}{n}}. \]

In Python:

from numpy import sqrt
n = 2463
pTrump = 0.49
pHarris = 0.472
se = sqrt(((pTrump+pHarris)-(pTrump-pHarris)**2)/n)
se

0.019759783548384566

Test-stat and \(p\)-value

We can then compute the test-statistic:

T = (pHarris-pTrump)/se
T

-0.9109411525649824

And the \(p\)-value:

from scipy.stats import norm
norm.cdf(T)

0.181163190493631

Since \(0.18 \not< 0.05\), we fail to reject the null hypothesis.

Election probabilities

I guess a pretty important question right now might be:

So, who’s going to win the election?

How to mis-interpret the \(p\)-value

First off, let’s be clear that \(p\)-values do not represent probabilities of events!

In particular, our \(p\)-value of \(0.18\) that that rejects the null hypothesis that the candidates have equal support in favor of the alternative support that Trump has larger support than Harris does not imply that Trump is 92% likely to win the election.

The \(p\)-value represents the probability that a process might generate data at least as bad as our sample under the assumption of the null hypothesis.

The \(p\)-value is simply completely different from the probability of an event occuring.

Silver Bulletin estimates

As of this morning (Monday, November 4), Silver Bulletin gives

Harris a 74.2% chance of winning the popular vote and a 47% chance of winning the electoral vote.
Trump a 25.8% chance of winning the popular vote and a 52.6% chance of winning the electoral vote.

Hopefully, we’ll know who wins by our next class period!!