Like variable selection, principal component analysis (or PCA) is a dimensional reduction technique. Rather than simply reducing the number of variables, though, PCA looks directly at the geometry of the data as it lies in feature space. To do so, PCA uses much of the linear algebra we’ve learned to this point to project the data onto a lower dimensional subspace that contains maximal information.
The basic objective of PCA is to find a set of orthogonal axes (the so-called principal components) such that:
The variance of the data is maximized along the first axis and
each new axis captures as much of the remaining variance as possible.
Sample 2D data
Here’s some randomly generated data in two dimensions that should be concentrated about a line of slope \(1/3\); we’ll generate 20 points and display 5 of them:
x = [random.uniform(0,5) for i inrange(20)]y = [xx/3+ random.normalvariate(0,0.4) +1for xx in x]data = np.array([x,y]).Tpd.DataFrame(data, columns = {'x': x, 'y': y}).head(5)
x
y
0
4.864954
2.874621
1
2.465926
2.101910
2
4.719375
2.535226
3
0.369608
1.135162
4
4.723372
3.156440
Centered data
Let’s shift the data so that it’s centered on the origin and plot it:
Code
mean = np.mean(data, axis=0)centered_data = data - mean[x,y] = centered_data.Tplt.plot(x,y, 'ok')ax = plt.gca()ax.set_xlim(-2.5,2.5)ax.set_ylim(-2,1)ax.set_aspect(1)fig = plt.gcf()ax.spines['left'].set_position('zero')ax.spines['bottom'].set_position('zero')ax.spines['right'].set_color('none')ax.spines['top'].set_color('none')fig_data = pickle.dumps(fig)
Principal components 2D
The objective of PCA is to use the data to find the direction of maximum variation. It seems that direction should contain “the most information”. If you expand the code below, you should notice that the computation of that direction involves the eigenvectors of the so-called covariance matrix. It’s not a regression line!
If we project onto that first principal component, we get a new data set (of smaller dimension) that should contain nearly as much information as the full data set.
Code
fig = pickle.loads(fig_data)# Project onto first principal component v1def P(x):return (v1.dot(x)/v1.dot(v1)) * v1for [x,y] in centered_data: [xx,yy] = np.array(P([x,y])).T plt.plot([x,xx],[y,yy], 'k', linestyle = (0,(1,2))) plt.plot([xx],[yy], 'or')
Higher dimensions
This can all be done in higher dimensions as well, though full visualization stops at 3D:
We’re going to illustrate the ideas behind PCA with some actual data. The particular example we’ll use is a classic data set on wine.
The data set’s description says that
The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators. There are thirteen different measurements taken for different constituents found in the three types of wine.
The wine data set
There are 178 wines in the data let’s download it and examine the first 5 rows:
Often, there’s a bit more to this. Missing data must typically be imputed and categorical data must be encoded. Of particular importance for us is that data is almost always standardized to have mean zero and standard deviation 1. Our presentation of PCA will assume that each data column is at least centered to have mean zero.
Feature space
Our data set has 178 rows (corresponding to individual wines or observations) and 14 columns (corresponding to features). Our objective will be to build an algorithm that can predict one of the features (the variety) from the others. Thus, we’ll have 13 purely numerical input features and 1 categorical target that can take on the values zero, one or two.
Thus, we have 178 input data points that live in 13 dimensional space. More generally, an \(N \times M\) data matrix generates \(N\) points that live in \(M\) dimensional space.
This space is sometimes called feature space.
A little math review
PCA is built on several mathematical concepts. One of those concepts is that of an eigenvalue/eigenvector pair that we discussed quite recently. A couple of other tools date back a bit earlier this semester so let’s take a minute to review those. WE’ll also extend one a bit.
Orthogonal projection
Recall that the the basic idea of orthogonal projection of the vector \(\vect{x}\) onto \(\vect{b}\) is to find a vector \(P(\vect{x})\) such that \(\vect{x}-P(\vect{x})\) is perpendicular to \(\vect{b}\). Intuitively, \(P(\vect{x})\) is the shadow of \(\vect{x}\) on the line determined by \(\vect{b}\).
Formula for orthogonal projection
You might recall that we’ve got a nice little formula for projection, namely
Note that this simplifies when \(\|\vect{v}\| = 1\) to \[P(\vect{x}) = (\vect{v}\cdot\vect{x}) \vect{v}.\]
If we have a bunch of vectors that we want to project onto a unit vector, we might place those into rows of a matrix \(B\) and simply compute \[B\vect{v}.\]
This is the formula we’ll use for PCA.
Mean, variance and covariance
We’ve already discussed mean and variance quite fully. Here are those definitions as they commonly appear in the context of PCA. We assume we have a data vector \[
\vect{x} = \left[x_1 \: x_2 \: \cdots \: x_N\right]^{\mathsf{T}}.
\]
Then, the mean \(E(\vect{x})\) and the variance \(\var({\vect{x}})\) of \(\vect{x}\) are defined by
\[\begin{aligned}
E(\vect{x}) &= \frac{1}{N}\sum_{n=1}^N x_n \\
\var({\vect{x}}) &= \frac{1}{N}\sum_{n=1}^N (x_n-E(\vect{x}))^2.
\end{aligned}\]
Covariance
Where variance is a numerical measure of the variation inherent in a single variable, covariance measures the joint variability of two related variables. In the context of data, variance operates on single column in a data frame, while covariance operates on a pair of columns.
Thus, suppose we’ve got two data vectors \[
\vect{x} = \left[x_1 \: x_2 \: \cdots \: x_N\right]^{\mathsf{T}}
\text{ and }
\vect{y} = \left[y_1 \: y_2 \: \cdots \: y_N\right]^{\mathsf{T}}
\] with means \(E(\vect{x})\) and \(E(\vect{y})\). Then, the covariance between them is defined by \[
\text{Cov}(\vect{x}, \vect{y}) = \frac{1}{N} \sum_{n=1}^N
(x_n-E(\vect{x}))(y_n-E(\vect{y})).
\]
Comments
Covariance between variables is certainly a bit trickier to understand than plain old variance. However,
Note that variance is just a special case of covariance; that is \[\var(\vect{x}) = \text{Cov}(\vect{x}, \vect{x}).\]
If the data is centered, so that the means are zero, then the covariance simplifies to \[
\text{Cov}(\vect{x}, \vect{y}) = \frac{1}{N} \sum_{n=1}^N x_ny_n = \frac{1}{N} \vect{x}^{\mathsf{T}} \vect{y}.
\]
A look at a few pictures of covariance applied to 2D scatter plots should help develop your intuition.
Negative covariance
In the picture below, \(\text{Cov}(\vect{x},\vect{y})<0\) and \(\var(\vect{y}) < \var(\vect{x})\).
Zero covariance
The covariance between \(\vect{x}\) and \(\vect{y}\) shown in the picture below is very close to zero, while \(\var(\vect{y}) < \var(\vect{x})\).
Higer dimensional data
Now suppose we’ve got a purely numerical data frame
\(\vect{x}_1\)
\(\vect{x}_2\)
\(\vect{x}_3\)
\(\cdots\)
\(\vect{x}_M\)
\(x_{11}\)
\(x_{12}\)
\(x_{13}\)
\(\cdots\)
\(x_{1M}\)
\(x_{21}\)
\(x_{22}\)
\(x_{23}\)
\(\cdots\)
\(x_{2M}\)
\(\vdots\)
\(\vdots\)
\(\vdots\)
\(\vdots\)
\(\vdots\)
\(x_{N1}\)
\(x_{N2}\)
\(x_{N3}\)
\(\cdots\)
\(x_{NM}\)
Which could be represented in a matrix \[X = \begin{bmatrix}
x_{11}&x_{12}&x_{13}&\cdots&x_{1M} \\
x_{21}&x_{22}&x_{23}&\cdots&x_{2M} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
x_{N1}&x_{N2}&x_{N3}&\cdots&x_{NM}
\end{bmatrix}.\]
Covariance matrix
Assuming the data is centered again, the covariance matrix of \(X\) is the matrix \[\Sigma = \frac{1}{N} X^{\mathsf{T}} X.\] It’s easy to check that the entry \([\Sigma]_{ij}\) in row \(i\) and column \(j\) of \(\Sigma\) is exactly the covariance of the \(i^{\text{th}}\) column with the \(j^{\text{th}}\) column of \(X\).
It turns out that the principal components are exactly the eigenvectors of the covariance matrix!
How PCA works
Again, we’ve got an \(N\times M\) data matrix \(X\). The data lives in \(M\)-dimensional space, since the columns correspond to the features in the data. We’d like to project it onto a lower dimensional subspace that contains maximal variance. We start by considering how to project it onto a one-dimensional subspace with maximal information.
Put another way, we wish to find a single vector \(\vect{v}\) such that \[\var(X\vect{v})\] is maximized. Of course, we can make that as large as we want by choosing \(\vect{v}\) to be as long as we want. Thus, we add the restriction that \(\|\vect{v}\| = 1\).
The crux of the solution
Since the columns of \(X\) have mean zero and \(X\vect{v}\) is a linear combination of those columns, the mean of the column vector \(X\vect{v}\) must also be zero. Thus, \[N\,\var(X\vect{v}) = (X\vect{v})^{\mathsf{T}} (X\vect{v}) = \vect{v}^{\mathsf{T}} X^{\mathsf{T}} X \vect{v}.\] Now, let’s suppose that \(\vect{v}\) is a normalized eigenvector of \(X^{\mathsf{T}} X\) with eigenvalue \(\sigma\) and that \(\sigma\) is the largest eigenvalue. Then, \[\vect{v}^{\mathsf{T}} X^{\mathsf{T}} X \vect{v} = \sigma \vect{v}^{\mathsf{T}} \vect{v} = \sigma.\] Since \(\sigma\) is the maximum eigenvalue, there is no way to make this variance larger. Thus, \(\vect{v}\) is exactly the direction of maximum variance.
Moving beyond one dimension
Once you’ve found the direction of largest variation, you can project the remaining data along that direction onto the \(M-1\) dimensional subspace called the orthogonal complement. You can then find the direction of next greatest variability.
Proceeding recursively, you can find as many principal components as you need.
Example
Here’s an example that looks kinda like a forum problem over this material. The idea is to find the principal components for the data set \[
\{(-1,-1), (-1,1), (0,1), (1,1), (2,0)\}.
\] I’ll show you how plot this data and illustrate the first principal component here using the computer. Note that forum problem is certainly workable by hand.
Plotting the data
Let’s place the data into a matrix and plot it:
Code
X = np.matrix([[-1,-1],[-1,1],[0,1],[1,1],[2,0]])x = X[:,0]y = X[:,1]plt.plot(x,y,'ok')plt.gca().set_aspect('equal')
Computing the principal components
The principal components are exactly the eigenvectors ordered by the eigenvalues in decreasing order of magnitude.
S = X.transpose()*Xprint("S: ", S)eigen_info = eig(S)print("eigen_info: ", eigen_info)
The output of eig is a tuple, the first of which is an array whose entries are the eigenvalues of \(S\) and the second of which is an array whose columns are the eigenvectors. Thus, when I see
I know that the first principal component must be \[\begin{bmatrix}0.95709203 & 0.28978415\end{bmatrix}^{\mathsf{T}}.\]
Illustration
Now, we plot the data together with the line determined by the first principal component.
Code
v = np.asarray(eigen_info[1][:, 0]).ravel()plt.plot(x,y,'ok')plt.plot([-2*v[0],2*v[0]],[-2*v[1],2*v[1]])plt.gca().set_aspect('equal')
Applied example
Finally, let’s see how PCA is applied in a standard ML workflow. Specifically, we’ll apply PCA as a preprocessing step in logistic regression for classification in our wine data set. The mission is still to determine wine variety from our other variables.
alcohol
malic_acid
ash
alcalinity_of_ash
magnesium
total_phenols
flavanoids
nonflavanoid_phenols
proanthocyanins
color_intensity
hue
od280/od315_of_diluted_wines
proline
variety
0
14.23
1.71
2.43
15.6
127.0
2.80
3.06
0.28
2.29
5.64
1.04
3.92
1065.0
0
1
13.20
1.78
2.14
11.2
100.0
2.65
2.76
0.26
1.28
4.38
1.05
3.40
1050.0
0
2
13.16
2.36
2.67
18.6
101.0
2.80
3.24
0.30
2.81
5.68
1.03
3.17
1185.0
0
3
14.37
1.95
2.50
16.8
113.0
3.85
3.49
0.24
2.18
7.80
0.86
3.45
1480.0
0
4
13.24
2.59
2.87
21.0
118.0
2.80
2.69
0.39
1.82
4.32
1.04
2.93
735.0
0
Setting up the data
Let’s split the data into training and testing sets:
# The predictor variablesX = wine.drop(columns ='variety')# The target variable y = wine.variety# The train/test splitX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=1, stratify=y)
Pure logistic regression
Now, let’s train a pure logistic regression and print the accuracy results:
The first regression attains 98% accuracy using all thirteen variables. The second regression attains only 96% accuracy but uses only two principal components as predictors.
In a situation where we have tens of thousands of variables, that could be a huge deal!
Principal components
Here are the proportions of variance for each of the 13 principal components.
Comments
Covariance between variables is certainly a bit trickier to understand than plain old variance. However,