Principal Component Analysis

Wed, Apr 01, 2026

Principal Component Analysis

Like variable selection, principal component analysis (or PCA) is a dimensional reduction technique. Rather than simply reducing the number of variables, though, PCA looks directly at the geometry of the data as it lies in feature space. To do so, PCA uses much of the linear algebra we’ve learned to this point to project the data onto a lower dimensional subspace that contains maximal information.

\[ \newcommand{\vect}[1]{\mathbf{#1}} \newcommand{\var}[1]{\text{Var}{#1}} \newcommand{\inverse}[1]{#1^{-1}} \newcommand{\similar}[2]{\inverse{#2}#1#2} \]

Imports

We’ll be looking at some code as we move through this stuff today; here are the libraries we’ll be using:

import random
import pickle
import numpy as np
import pandas as pd
from scipy.linalg import eig
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine

The objective of PCA

The basic objective of PCA is to find a set of orthogonal axes (the so-called principal components) such that:

The variance of the data is maximized along the first axis and
each new axis captures as much of the remaining variance as possible.

Sample 2D data

Here’s some randomly generated data in two dimensions that should be concentrated about a line of slope \(1/3\); we’ll generate 20 points and display 5 of them:

x = [random.uniform(0,5) for i in range(20)]
y = [xx/3 + random.normalvariate(0,0.4) + 1 for xx in x]
data = np.array([x,y]).T
pd.DataFrame(data, columns = {'x': x, 'y': y}).head(5)

	x	y
0	4.864954	2.874621
1	2.465926	2.101910
2	4.719375	2.535226
3	0.369608	1.135162
4	4.723372	3.156440

Centered data

Let’s shift the data so that it’s centered on the origin and plot it:

Code

mean = np.mean(data, axis=0)
centered_data = data - mean

[x,y] = centered_data.T
plt.plot(x,y, 'ok')
ax = plt.gca()
ax.set_xlim(-2.5,2.5)
ax.set_ylim(-2,1)
ax.set_aspect(1)
fig = plt.gcf()
ax.spines['left'].set_position('zero')
ax.spines['bottom'].set_position('zero')
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')

fig_data = pickle.dumps(fig)

Principal components 2D

The objective of PCA is to use the data to find the direction of maximum variation. It seems that direction should contain “the most information”. If you expand the code below, you should notice that the computation of that direction involves the eigenvectors of the so-called covariance matrix. It’s not a regression line!

Code

fig = pickle.loads(fig_data)
cov_matrix = np.cov(centered_data.T)
eigvals, eigvecs = np.linalg.eig(cov_matrix)
idx = eigvals.argsort()[::-1]
eigvals = eigvals[idx]
eigvecs = eigvecs[:, idx]

v1 = 2.5*eigvecs.T[0]
plt.plot([-v1[0], v1[0]], [-v1[1], v1[1]])
v2 = eigvecs.T[1]
plt.plot([-v2[0], v2[0]], [-v2[1], v2[1]])
fig_data = pickle.dumps(fig)

Projection onto PC1

If we project onto that first principal component, we get a new data set (of smaller dimension) that should contain nearly as much information as the full data set.

Code

fig = pickle.loads(fig_data)
# Project onto first principal component v1
def P(x):
    return (v1.dot(x)/v1.dot(v1)) * v1
for [x,y] in centered_data:
    [xx,yy] = np.array(P([x,y])).T
    plt.plot([x,xx],[y,yy], 'k', linestyle = (0,(1,2)))
    plt.plot([xx],[yy], 'or')

Higher dimensions

This can all be done in higher dimensions as well, though full visualization stops at 3D:

A working data set

We’re going to illustrate the ideas behind PCA with some actual data. The particular example we’ll use is a classic data set on wine.

The data set’s description says that

The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators. There are thirteen different measurements taken for different constituents found in the three types of wine.

The wine data set

There are 178 wines in the data let’s download it and examine the first 5 rows:

wine = pd.read_csv('https://marksmath.org/data/wine.csv')
sample = wine.head(5)
sample

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0

The data as a matrix

This particular data set is mostly numeric; thus it’s easy to imagine that it could be viewed as a matrix, like so:

\[ \tiny \begin{bmatrix} 14.230 & 1.710 & 2.430 & 15.600 & 127.000 & 2.800 & 3.060 & 0.280 & 2.290 & 5.640 & 1.040 & 3.920 & 1065.000 \\13.200 & 1.780 & 2.140 & 11.200 & 100.000 & 2.650 & 2.760 & 0.260 & 1.280 & 4.380 & 1.050 & 3.400 & 1050.000 \\13.160 & 2.360 & 2.670 & 18.600 & 101.000 & 2.800 & 3.240 & 0.300 & 2.810 & 5.680 & 1.030 & 3.170 & 1185.000 \\14.370 & 1.950 & 2.500 & 16.800 & 113.000 & 3.850 & 3.490 & 0.240 & 2.180 & 7.800 & 0.860 & 3.450 & 1480.000 \\13.240 & 2.590 & 2.870 & 21.000 & 118.000 & 2.800 & 2.690 & 0.390 & 1.820 & 4.320 & 1.040 & 2.930 & 735.000 \\\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \end{bmatrix} \]

Often, there’s a bit more to this. Missing data must typically be imputed and categorical data must be encoded. Of particular importance for us is that data is almost always standardized to have mean zero and standard deviation 1. Our presentation of PCA will assume that each data column is at least centered to have mean zero.

Feature space

Our data set has 178 rows (corresponding to individual wines or observations) and 14 columns (corresponding to features). Our objective will be to build an algorithm that can predict one of the features (the variety) from the others. Thus, we’ll have 13 purely numerical input features and 1 categorical target that can take on the values zero, one or two.

Thus, we have 178 input data points that live in 13 dimensional space. More generally, an \(N \times M\) data matrix generates \(N\) points that live in \(M\) dimensional space.

This space is sometimes called feature space.

A little math review

PCA is built on several mathematical concepts. One of those concepts is that of an eigenvalue/eigenvector pair that we discussed quite recently. A couple of other tools date back a bit earlier this semester so let’s take a minute to review those. WE’ll also extend one a bit.

Orthogonal projection

Recall that the the basic idea of orthogonal projection of the vector \(\vect{x}\) onto \(\vect{b}\) is to find a vector \(P(\vect{x})\) such that \(\vect{x}-P(\vect{x})\) is perpendicular to \(\vect{b}\). Intuitively, \(P(\vect{x})\) is the shadow of \(\vect{x}\) on the line determined by \(\vect{b}\).

Formula for orthogonal projection

You might recall that we’ve got a nice little formula for projection, namely

\[P(\vect{x}) = \frac{\vect{v}\cdot\vect{x}}{\vect{v}\cdot\vect{v}} \vect{v}.\]

Note that this simplifies when \(\|\vect{v}\| = 1\) to \[P(\vect{x}) = (\vect{v}\cdot\vect{x}) \vect{v}.\]

If we have a bunch of vectors that we want to project onto a unit vector, we might place those into rows of a matrix \(B\) and simply compute \[B\vect{v}.\]

This is the formula we’ll use for PCA.

Mean, variance and covariance

We’ve already discussed mean and variance quite fully. Here are those definitions as they commonly appear in the context of PCA. We assume we have a data vector \[ \vect{x} = \left[x_1 \: x_2 \: \cdots \: x_N\right]^{\mathsf{T}}. \]

Then, the mean \(E(\vect{x})\) and the variance \(\var({\vect{x}})\) of \(\vect{x}\) are defined by \[\begin{aligned} E(\vect{x}) &= \frac{1}{N}\sum_{n=1}^N x_n \\ \var({\vect{x}}) &= \frac{1}{N}\sum_{n=1}^N (x_n-E(\vect{x}))^2. \end{aligned}\]

Covariance

Where variance is a numerical measure of the variation inherent in a single variable, covariance measures the joint variability of two related variables. In the context of data, variance operates on single column in a data frame, while covariance operates on a pair of columns.

Thus, suppose we’ve got two data vectors \[ \vect{x} = \left[x_1 \: x_2 \: \cdots \: x_N\right]^{\mathsf{T}} \text{ and } \vect{y} = \left[y_1 \: y_2 \: \cdots \: y_N\right]^{\mathsf{T}} \] with means \(E(\vect{x})\) and \(E(\vect{y})\). Then, the covariance between them is defined by \[ \text{Cov}(\vect{x}, \vect{y}) = \frac{1}{N} \sum_{n=1}^N (x_n-E(\vect{x}))(y_n-E(\vect{y})). \]

Comments

Covariance between variables is certainly a bit trickier to understand than plain old variance. However,

Note that variance is just a special case of covariance; that is \[\var(\vect{x}) = \text{Cov}(\vect{x}, \vect{x}).\]
If the data is centered, so that the means are zero, then the covariance simplifies to \[ \text{Cov}(\vect{x}, \vect{y}) = \frac{1}{N} \sum_{n=1}^N x_ny_n = \frac{1}{N} \vect{x}^{\mathsf{T}} \vect{y}. \]
A look at a few pictures of covariance applied to 2D scatter plots should help develop your intuition.

Negative covariance

In the picture below, \(\text{Cov}(\vect{x},\vect{y})<0\) and \(\var(\vect{y}) < \var(\vect{x})\).

Zero covariance

The covariance between \(\vect{x}\) and \(\vect{y}\) shown in the picture below is very close to zero, while \(\var(\vect{y}) < \var(\vect{x})\).

Higer dimensional data

Now suppose we’ve got a purely numerical data frame

\(\vect{x}_1\)	\(\vect{x}_2\)	\(\vect{x}_3\)	\(\cdots\)	\(\vect{x}_M\)
\(x_{11}\)	\(x_{12}\)	\(x_{13}\)	\(\cdots\)	\(x_{1M}\)
\(x_{21}\)	\(x_{22}\)	\(x_{23}\)	\(\cdots\)	\(x_{2M}\)
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)
\(x_{N1}\)	\(x_{N2}\)	\(x_{N3}\)	\(\cdots\)	\(x_{NM}\)

Which could be represented in a matrix \[X = \begin{bmatrix} x_{11}&x_{12}&x_{13}&\cdots&x_{1M} \\ x_{21}&x_{22}&x_{23}&\cdots&x_{2M} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ x_{N1}&x_{N2}&x_{N3}&\cdots&x_{NM} \end{bmatrix}.\]

Covariance matrix

Assuming the data is centered again, the covariance matrix of \(X\) is the matrix \[\Sigma = \frac{1}{N} X^{\mathsf{T}} X.\] It’s easy to check that the entry \([\Sigma]_{ij}\) in row \(i\) and column \(j\) of \(\Sigma\) is exactly the covariance of the \(i^{\text{th}}\) column with the \(j^{\text{th}}\) column of \(X\).

It turns out that the principal components are exactly the eigenvectors of the covariance matrix!

How PCA works

Again, we’ve got an \(N\times M\) data matrix \(X\). The data lives in \(M\)-dimensional space, since the columns correspond to the features in the data. We’d like to project it onto a lower dimensional subspace that contains maximal variance. We start by considering how to project it onto a one-dimensional subspace with maximal information.

Put another way, we wish to find a single vector \(\vect{v}\) such that \[\var(X\vect{v})\] is maximized. Of course, we can make that as large as we want by choosing \(\vect{v}\) to be as long as we want. Thus, we add the restriction that \(\|\vect{v}\| = 1\).

The crux of the solution

Since the columns of \(X\) have mean zero and \(X\vect{v}\) is a linear combination of those columns, the mean of the column vector \(X\vect{v}\) must also be zero. Thus, \[N\,\var(X\vect{v}) = (X\vect{v})^{\mathsf{T}} (X\vect{v}) = \vect{v}^{\mathsf{T}} X^{\mathsf{T}} X \vect{v}.\] Now, let’s suppose that \(\vect{v}\) is a normalized eigenvector of \(X^{\mathsf{T}} X\) with eigenvalue \(\sigma\) and that \(\sigma\) is the largest eigenvalue. Then, \[\vect{v}^{\mathsf{T}} X^{\mathsf{T}} X \vect{v} = \sigma \vect{v}^{\mathsf{T}} \vect{v} = \sigma.\] Since \(\sigma\) is the maximum eigenvalue, there is no way to make this variance larger. Thus, \(\vect{v}\) is exactly the direction of maximum variance.

Moving beyond one dimension

Once you’ve found the direction of largest variation, you can project the remaining data along that direction onto the \(M-1\) dimensional subspace called the orthogonal complement. You can then find the direction of next greatest variability.

Proceeding recursively, you can find as many principal components as you need.

Example

Here’s an example that looks kinda like a forum problem over this material. The idea is to find the principal components for the data set \[ \{(-1,-1), (-1,1), (0,1), (1,1), (2,0)\}. \] I’ll show you how plot this data and illustrate the first principal component here using the computer. Note that forum problem is certainly workable by hand.

Plotting the data

Let’s place the data into a matrix and plot it:

Code

X = np.matrix([[-1,-1],[-1,1],[0,1],[1,1],[2,0]])
x = X[:,0]
y = X[:,1]
plt.plot(x,y,'ok')
plt.gca().set_aspect('equal')

Computing the principal components

The principal components are exactly the eigenvectors ordered by the eigenvalues in decreasing order of magnitude.

S = X.transpose()*X
print("S: ", S)
eigen_info = eig(S)
print("eigen_info: ", eigen_info)

S:  [[7 1]
 [1 4]]
eigen_info:  (array([7.30277564+0.j, 3.69722436+0.j]), array([[ 0.95709203, -0.28978415],
       [ 0.28978415,  0.95709203]]))

Interpretation

The output of eig is a tuple, the first of which is an array whose entries are the eigenvalues of \(S\) and the second of which is an array whose columns are the eigenvectors. Thus, when I see

array([
  [ 0.95709203, -0.28978415],
  [ 0.28978415,  0.95709203]
])

I know that the first principal component must be \[\begin{bmatrix}0.95709203 & 0.28978415\end{bmatrix}^{\mathsf{T}}.\]

Illustration

Now, we plot the data together with the line determined by the first principal component.

Code

v = np.asarray(eigen_info[1][:, 0]).ravel()
plt.plot(x,y,'ok')
plt.plot([-2*v[0],2*v[0]],[-2*v[1],2*v[1]])
plt.gca().set_aspect('equal')

Applied example

Finally, let’s see how PCA is applied in a standard ML workflow. Specifically, we’ll apply PCA as a preprocessing step in logistic regression for classification in our wine data set. The mission is still to determine wine variety from our other variables.

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0

Setting up the data

Let’s split the data into training and testing sets:

# The predictor variables
X = wine.drop(columns = 'variety')

# The target variable 
y = wine.variety

# The train/test split
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.33, random_state=1, stratify=y)

Pure logistic regression

Now, let’s train a pure logistic regression and print the accuracy results:

pipeline1 = Pipeline([
    ("scaler", StandardScaler()),
    ("logreg", LogisticRegression())])
pipeline1.fit(X_train, y_train)
y_pred1 = pipeline1.predict(X_test)
accuracy_score(y_test, y_pred1)

0.9830508474576272

Logistic regression with PCA

Let’s try that again using PCA to reduce the predictors down to just two variables:

pipeline2 = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=2)), 
    ("logreg", LogisticRegression())
])
pipeline2.fit(X_train, y_train)
y_pred2 = pipeline2.predict(X_test)
accuracy_score(y_test, y_pred2)

0.9661016949152542

The point

The first regression attains 98% accuracy using all thirteen variables. The second regression attains only 96% accuracy but uses only two principal components as predictors.

In a situation where we have tens of thousands of variables, that could be a huge deal!

Principal components

Here are the proportions of variance for each of the 13 principal components.