
Mon, Mar 02, 2026
Recently, we reviewed the basics of integration and then we saw an overview of how that might be applied to probability theory. Today, we start probability theory in earnest with a discussion of discrete probability.
Probability theory is largely concerned with computing or estimating the probability that certain events might happen. Let’s begin by clarifying our language.
Mathematically, probability is a function \(P\) that accepts events and assigns a number in \([0,1]\).
The serious, mathematical study of probability theory has its origins in the \(17^{\text{th}}\) century in the writings of Blaise Pascal who studied - gambling. Many elementary examples in probability theory are still commonly phrased in terms of gambling. As a result, it’s long standing tradition to use coin flips, dice rolls and playing cards to illustrate the basic principles of probability. Thus, it does help to have a basic familiarity with these things.
A standard deck of playing cards consists of 52 cards divided into 13 ranks each of which is further divided into 4 suits

When we speak of a “well shuffled deck” we mean that, when we draw one card, each card is equally likely to be drawn.
Suppose we draw a single card from a well-shuffled deck.
Example events could be:
The probability function assigns probabilities to events. In our previous examples,
That brings us to our first formula for computing probabilities!
The events \(A\) and \(B\) are called mutually exclusive if only one of them can occur. Another term for this same concept is disjoint. If we know the probability that \(A\) occurs and we know the probability that \(B\) occurs, then we can compute the probability that \(A\) or \(B\) occurs by simply adding the probabilities. Symbolically,
\[P(A \text{ or } B) = P(A) + P(B).\]
Thus, for example, the probability of drawing a heart is \(13/52=1/4\). We can see this since there are 13 hearts, each with probability of being drawn of \(1/52\), and they are mutually exclusive.
We say that two events are independent if the outcome of one has no bearing on the outcome of the other. If I flip a coin twice, for example, the outcome of the first flip shouldn’t affect the outcome of the second.
When two events are independent, we can compute the probability that both occur by multiplying their probabilities. Symbolically, \[P(A \text{ and } B) = P(A)P(B).\]
If I flip a coin twice, for example, the probability that I get two heads is \(1/4\).
Independent events are well modelled by coin flips so let’s develop some relevant terminology
A binary event is an event with exactly two possible outcomes. We often express a randomly chosen, binary event in terms of a coin toss because…
Fact
Coins don’t land on their sides!
Generally, we think of a coin toss as “fair”;
that is, it comes up heads or tails 50:50.
We are often interested in modelling binary events that are not 50:50. Perhaps, we have a 70:30 coin that comes up heads 70% of the time and tails 30% of the time.
We can apply similar ideas to dice.
The most widely used dice are small cubes with six sides. We think of them as “fair” in that they produce numbers one through six with equal probability of \(1/6\) for each.
In probability theory, we often speak of \(n\)-sided die which might be fair or unfair.
| 1 | 2 | 3 | 4 |
|---|---|---|---|
| 0.1 | 0.2 | 0.3 | 0.4 |
Suppose I flip a coin that comes up heads 70% of the time and I also roll a four sided die with the probabilities
| 1 | 2 | 3 | 4 |
|---|---|---|---|
| 0.1 | 0.2 | 0.3 | 0.4 |
I count a coin flip as \(1\) if it’s heads and \(0\) if it’s tails. I then add the result of the coin flip and the die roll. What’s the probability that my combined total is \(3\)?
Answer
\[0.7\times0.2 + 0.3\times0.3 = 0.23.\]
A random variable is a simply a function \(X\) defined on the sample space that returns real numbers. That is, \(X:\Omega\to\mathbb R\).
Often, it’s simpler and sufficient to think of a random variable as a random process with some numerical outcome.
Suppose, for example, I roll a 10 sided die and write down
I guess it’s easy to see how this is a random process with a numerical outcome.
In the previous example, the sample space is \[ \Omega = \{1,2,3,4,5,6,7,8,9,10\}. \] The possible outcomes of \(X\) are \(1\), \(2\), or \(3\). The random variable can be defined by simply listing its values
Note that the range of this random variable is contained in the integers. That makes this random variable discrete, rather than continuous.
There are certainly random variables that can, in principle, produce any real number.
These are all continuous random variables. In the language of probability theory, we would say that the sample space is the set of real numbers. We’ll focus on these next time.
Roughly, the distribution of a random variable tells you how likely that random variable is to produce certain outputs. For a discrete random variable, this boils down to listing out the probability that the random variable hits every specific value.
The distribution of a discrete random variable is defined to be a table of all the possible outcomes together with their probabilities.
For example 3 above, we might write
| X | 1 | 2 | 3 |
|---|---|---|---|
| p | 3/10 | 4/10 | 3/10 |
Note that all the probabilities should be non-negative and they should sum to one.
An alternative is to specify the values of a function \(P\) for the possible values of \(X\). For example, the previous table could be written:
As we’ll see next time, this notation will extend in a natural way to continuous distributions.
A final alternative is to represent a discrete distribution with a plot.
Pictorial representations of discrete distributions make particular sense when there are a large number of possibilities. The image below portrays a discrete distribution where the sample space is the set of integers between 1 and 100 and smaller numbers are more likely to be chosen than larger numbers.
We can extend the idea of independence of random events to independence of random variables in a natural way.
We say that the random variables \(X\) and \(Y\) are independent if \[ E(XY) = E(X)E(Y). \]
Suppose I flip a 75:25 coin. Let \[ X_i = \begin{cases} 1 & \text{if } i^{\text{th}} \text{ flip is a head} \\ 0 & \text{if } i^{\text{th}} \text{ flip is a tail} \end{cases} \]
The following tables summarize the possibilities for \(X_1\) and \(X_2\), together with the probabilities for \(X_1X_2\) expressed as products.
| \(X_1=1\) | \(X_1=0\) | |
|---|---|---|
| \(X_2=1\) | \(\frac{3}{4}\times\frac{3}{4}\) | \(\frac{1}{4}\times\frac{3}{4}\) |
| \(X_2=0\) | \(\frac{3}{4}\times\frac{1}{4}\) | \(\frac{1}{4}\times\frac{1}{4}\) |
or
| \(X_1=1\) | \(X_1=0\) | |
|---|---|---|
| \(X_2=1\) | \(\frac{9}{16}\) | \(\frac{3}{16}\) |
| \(X_2=0\) | \(\frac{3}{16}\) | \(\frac{1}{16}\) |
Ultimately, this follows from the independence of the underlying events.
The mean and standard deviation that we learned for data can be extended to random variables using the idea of a weighted average.
The mean of a discrete random variable is \[ E(X) = \sum x_i P(X=x_i) = \sum x_i p_i.\] We might think of this as a weighted mean.
The mean is also frequently referred to as the expectation or the expected value. This is very common in statistics and even more so in prediction algorithms.
Recall our weighted die roll with probability distribution
| X | 1 | 2 | 3 |
|---|---|---|---|
| p | 3/10 | 4/10 | 3/10 |
The expected value of a roll of this die is \[1\frac{3}{10} + 2\frac{4}{10} + 3\frac{3}{10} = 2.\]
I’ve got a weighted coin that comes up heads \(75\%\) of the time, in which case I write down a one. If it comes up tails, I write down a zero.
The expectation associated with one flip is \[E(X) = 1\times \frac{3}{4} + 0 \times \frac{1}{4} = \frac{3}{4}.\]
The variance of a discrete random variable \(X\) is \[\sigma^2(X) = \sum (x_i - \mu)^2 p_i.\] We might think of this as a weighted average of the squared difference of the possible values from the mean. The standard deviation is the square root of the variance.
The variance of our weighted die roll in the example above is \[(1-2)^2\frac{3}{10} + (2-2)^2\frac{4}{10} + (3-2)^2\frac{3}{10} = \frac{6}{10}.\]
The variance of our weighted coin flip is \[\sigma^2(X) = (1-3/4)^2\frac{3}{4} + (0-3/4)^2 \frac{1}{4}=\frac{3}{16}.\]
One nice thing about expectation and variance is that they are additive. That is, if \(X_1\) and \(X_2\) are both random variables, then \[E(X_1 + X_2) = E(X_1) + E(X_2).\] If \(X_1\) and \(X_2\) are independent, then \[\sigma^2(X_1 + X_2) = \sigma^2(X_1) + \sigma^2(X_2).\]
Suppose I flip my weighted coin that comes up heads 75% of the time 100 times and let \(X_i\) denote the value of my \(i^{\text{th}}\) flip. Thus, \[X_1 + X_2 + \cdots + X_{100}\] represents the total number of heads that I get and, by the additivity of expectation, we get \[E(X_1 + X_2 + \cdots + X_{100}) = 100 \times \frac{3}{4} = 75.\]
Similarly, for the variance we get \[\sigma^2(X_1 + X_2 + \cdots + X_{100}) = 100 \times \frac{3}{16} = \frac{75}{4}.\] Of course, this means that the standard deviation is \(\sqrt{75/4}\).
Note: The standard deviation of one flip is \(\sqrt{3/16} \approx 0.433013\) and the standard deviation of 100 flips is \(\sqrt{75/4} \approx 4.33013\). The second is 10 times larger in magnitude but, relative to the total number of flips it’s \(0.0433013\), which is tens times smaller.
The binomial distribution is a discrete distribution that plays a special role in statistics for many reasons. Importantly for us, the binomial distribution allows us to see how a bell curve (in fact the normal curve) arises as a limit of other types of distributions.
Suppose we flip a coin 5 times and count how many heads we get. This will generate a random number \(X\) between 0 and 5 but they are not all equally likely. The probabilities are:
Note that the probability of getting any particular sequence of 5 heads and tails is \[\frac{1}{2^5} = \frac{1}{32}.\] That explains the denominator of 32 in the list of probabilities.
The numerator in that table is the number of ways to get the sum. For example, there are 10 ways to get 2 heads in 5 flips:
We’d like to formally generalize the binomial random variables. We’ll do so using the concept of a Bernoulli trial.
The random variable generated by a single flip of a (potentially) unfair coin is called a Bernoulli trial. Thus, \(X\) is Bernoulli if it can take the values zero and one with
Note that \(p\) is a real parameter satisfying \(0\leq p \leq 1\).
Now, a binomial random variable \(S\) is simply the sum of \(n\) independent Bernoulli trials with the same value of \(p\). That is, \[ S = \sum_{i=1}^{n} X_i \]
The interactive tool below allows you to play with the parameters defining a binomial distribution.
There’s a fabulous way to think about the binomial distribution in terms of combinations and permutations that leads to an exact and useful formula. The formula comes in two parts, one of which is the so-called “binomial coefficient” and the the other of which is a product of Bernoulli probabilities \(p\) or \(1-p\).
Suppose I run \(n\) independent Bernoulli trials, each with probability of success \(p\). The probability of getting exactly \(k\) successes is
\[ \boxed{\frac{n!}{k!(n-k)!}} \:\: \times \:\: \boxed{p^k(1-p)^{n-k}}. \]
If it’s not clear by now, you may think of a Bernoulli trial as the flip of an unfair coin.
Let’s suppose we flip a \(p:1-p\) coin \(7\) times. Then, any particular sequence of with five heads is equally likely to appear. Like the one below, they all have probability \[ p^5\times(1-p)^2. \]

More generally, the probability of getting \(k\) heads in \(n\) flips is \(p^k(1-p)^{n-k}\).
But there are a lot more ways to get exactly 5 heads in 7 flips. Here are all 21 ways:
Taken altogether, the probability of getting exactly five heads in seven flips of a \(70:30\) coin is \[ 21\times (0.7)^5 \times (0.3)^2 \]
More generally, the probability of getting exactly \(k\) heads in \(n\) flips of a \(p:(1-p)\) coin is \[ {n\choose k} p^k (1-p)^{n-k}. \]
The \(n \choose k\) term is called the binomial coefficient and denotes the number of ways to choose \(k\) thingamajigs from a larger collection of \(n\) thingamaboppers. It is read “\(n\) choose \(k\)” and has the formula \[ {n \choose k} = \frac{n!}{k!(n-k)!}. \] For example, \[ {7 \choose 5} = \frac{7!}{5!(7-5)!} = \frac{7\times 6}{2} = 21, \] as we’ve already seen.
In the binomial coefficient formula, \[ \frac{n!}{(n-k)!} = n \times (n-1) \times \cdots (n-(k-1)) \] denotes the number of ways pick \(k\) objects from \(n\) in order.
We then divide by the number of permutations of those \(k\) things to account for the fact that order doesn’t matter to form a set.
We’d like to quantify our understanding of the binomial distribution. To do so, we’ll first derive the distribution of a single binomial trial; we can then derive the binomial distribution using rules of combination for random variables.
By a Bernoulli trial, of course, we mean a random variable \(X\) such that,
\[P(X=1)=p \text{ and } P(X=0)=1-p.\]
From there, we can compute the mean and variance of \(X\) straight away:
\[E(X) = p\times1+(1-p)\times0 = p\]
and
\[ \begin{aligned} \sigma^2(X) &= p(1-p)^2 + (1-p)(0-p)^2 \\ &= (1-p)\left[p(1-p) + (-p)^2\right] \\ &= (1-p)\left(p-p^2+p^2\right) = p(1-p). \end{aligned} \] The standard deviation is then \[\sigma(X) = \sqrt{p(1-p)}.\]
Now suppose that \(S\) represents the sum of \(n\) independent runs of \(X\). Then, since mean and variance are additive, we have
\[E(S) = np,\] \[\sigma^2(S) = np(1-p),\] and \[\sigma(S) = \sqrt{np(1-p)}.\]
I’ve got a six-sided die with two sides labeled \(-1\) and four sides labeled \(5\).
What is the expected value of 1 roll?
Ans: \(\frac{2(-1)+4(5)}{6} = 3\)
What is the expected value of 100 rolls?
Ans: \(100\times3 = 300\)
What is the variance of 1 roll?
Ans: \((-1-3)^2\frac{1}{3} + (5-3)^2\frac{2}{3} = \frac{20}{3}\)
What is the variance of 100 rolls?
Ans: \(100\times\frac{20}{3}\)
Of course, the standard deviation is the square root of the variance.
Why would we do all this?
Ultimately, we’d like to use random variables to model data. A single Bernoulli trial corresponds to the outcome of an experiment or survey. A binomial formed by the sum of Bernoulli trials allows us to estimate a related proportion for the whole population.
We might stop someone on the street and ask if they plan to vote for Roy Cooper in the coming NC Senate election. That might give us some information but it’s just one data point.
If we ask 1000 randomly selected people the same question and compute the proportion who say yes or no, we get an actual estimate to the corresponding proportion for the whole population. If we want to know how good that estimate is, we need to know how that random process is distributed.
When we actually model the data, we’ll end up using a normal curve. The way to do so is to use the normal curve with the same mean and standard deviation.
Thus, we need to know what mean and standard deviation are in these varying contexts.
Here’s what that fit looks like, when we model a binomial distribution with the corresponding normal: