Principal component analysis (or PCA) is a dimensional reduction technique that’s more advanced than simply reducing the number of variables. PCA uses much of the linear algebra we’ve learned to this point to project the data onto a lower dimensional subspace that contains maximal information.
Find a set of orthogonal axes (the so-called principal components) such that:
The variance of the data is maximized along the first axis and
each new axis captures as much of the remaining variance as possible.
Sample 2D data
Here’s some randomly generated data in two dimensions that should be concentrated about a line of slope \(1/3\); we’ll generate 20 points and display 5 of them:
x = [random.uniform(0,5) for i inrange(20)]y = [xx/3+ random.normalvariate(0,0.4) +1for xx in x]data = np.array([x,y]).Tpd.DataFrame(data, columns = {'x': x, 'y': y}).head(5)
x
y
0
0.427452
0.996450
1
2.017142
2.064896
2
3.763688
2.352849
3
4.055300
2.381146
4
0.760795
1.581746
Centered data
Let’s shift the data so that it’s centered on the origin and plot it:
Code
mean = np.mean(data, axis=0)centered_data = data - mean[x,y] = centered_data.Tplt.plot(x,y, 'ok')ax = plt.gca()ax.set_xlim(-2.5,2.5)ax.set_ylim(-2,1)ax.set_aspect(1)fig = plt.gcf()ax.spines['left'].set_position('zero')ax.spines['bottom'].set_position('zero')ax.spines['right'].set_color('none')ax.spines['top'].set_color('none')fig_data = pickle.dumps(fig)
Principal components 2D
The objective of PCA is to use the data to find the direction of maximum variation. It seems that direction should contain “the most information”. If you expand the code below, you should notice that the the computation of that direction involves the eigenvectors of the so-called covariance matrix. It’s not the regression line!
If we project onto that first principal component, we get a new data set (of smaller dimension) that should contain nearly as much information as the full data set.
Code
fig = pickle.loads(fig_data)# Project onto first principal component v1def P(x):return (v1.dot(x)/v1.dot(v1)) * v1for [x,y] in centered_data: [xx,yy] = np.array(P([x,y])).T plt.plot([x,xx],[y,yy], 'k', linestyle = (0,(1,2))) plt.plot([xx],[yy], 'or')
Higher dimensions
This can all be done in higher dimensions as well, though full visualization stops at 3D:
Variance and Covariance
To understand this stuff, it helps to make sure we understand the basics of variance and covariance.
Variance is a numerical measure of the variation inherint in a single variable as represented by a column vector.
Covariance measures the joint variability of two related varialbes represented by a pair of column vectors.
Mean and Variance
Let’s get more precise. Supose we’ve got a data vector \[
\vect{x} = \left[x_1 \: x_2 \: \cdots \: x_N\right]^T
\]
Then, the mean and variance of \(\vect{x}\) are defined by
\[\begin{aligned}
\mu &= M(\vect{x}) = \frac{1}{N}\sum_{n=1}^N x_n \\ \sigma^2 &= V(\vect{x}) = \frac{1}{N}\sum_{n=1}^N (x_n-\bar{x})^2.
\end{aligned}\]
Thus, \(\sqrt{V(\vect{x})}\) (also called the standard deviation) measures how far the values of \(\vect{x}\) deviate from the mean on average.
Covariance
Now supose we’ve got two data vectors \[
\vect{x} = \left[x_1 \: x_2 \: \cdots \: x_N\right]^T
\text{ and }
\vect{y} = \left[y_1 \: y_2 \: \cdots \: y_N\right]^T
\] with means \(\mu_x\) and \(\mu_y\). Then, the covariance between them is defined by \[
\text{cov}(\vect{x}, \vect{y}) = \frac{1}{N} \sum_{n=1}^N
(x_n-\mu_x)(y_n-\mu_y).
\]
Covariance between variables is trickier to understand but maybe some pictures would help!
Negative covariance
In the picture below, \(\text{cov}(\vect{x},\vect{y})<0\) and \(V(\vect{y}) < V(\vect{x})\).
Zero covariance
The covariance between \(\vect{x}\) and \(\vect{y}\) shown in the picture below is very close to zero, while \(V(\vect{y}) < V(\vect{x})\).
Higer dimensional data
Now suppose we’ve got a purely numerical data frame
\(\vect{x}_1\)
\(\vect{x}_2\)
\(\vect{x}_3\)
\(\cdots\)
\(\vect{x}_M\)
\(x_{11}\)
\(x_{12}\)
\(x_{13}\)
\(\cdots\)
\(x_{1M}\)
\(x_{21}\)
\(x_{22}\)
\(x_{23}\)
\(\cdots\)
\(x_{2M}\)
\(\vdots\)
\(\vdots\)
\(\vdots\)
\(\vdots\)
\(\vdots\)
\(x_{N1}\)
\(x_{N2}\)
\(x_{N3}\)
\(\cdots\)
\(x_{NM}\)
Which could be represented in a matrix \[X = \begin{bmatrix}
x_{11}&x_{12}&x_{13}&\cdots&x_{1M} \\
x_{21}&x_{22}&x_{23}&\cdots&x_{2M} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
x_{N1}&x_{N2}&x_{N3}&\cdots&x_{NM}
\end{bmatrix}.\]
Covariance matrix
To simplify matters, we shift the data so that the columns have mean zero. We’ll assume that’s been done for our matrix \(X\):
Then, the covariance matrix of \(X\) is the matrix \[\Sigma = \frac{1}{N} X^T X.\] It’s easy to check that the entry \([\Sigma]_{ij}\) in row \(i\) and column \(j\) of \(\Sigma\) is exactly the covariance of the \(i^{\text{th}}\) row with the \(j^{\text{th}}\) column of \(X\).
Connection with PCA
It turns out that the principal components are exactly the eigenvectors of the covariance matrix!
How PCA works
Again, we’ve got an \(N\times M\) data matrix \(X\). The data lives in \(M\)-dimensional space and we’d like to project it onto a lower dimensional subspace that contains maximal variance. We start by considering how to project it onto a one-dimensional subspace with maximal information.
Put another way, we wish to find a single vector \(\vect{v}\) such that \[V(X\vect{v})\] has maximum variance. Of course, we can make that as large as we want by choosing \(\vect{v}\) to be as long as we want. Thus, we add the restriction that \(\|\vect{v}\| \leq 1\).
The crux of the solution
The key is to write \[NV(X\vect{v}) = \|X\vect{v}\|^2 = (X\vect{v})^T (X\vect{v}) = \vect{v}^T X^T X \vect{v}.\] Now, let’s suppose that \(\vect{v}\) is a normalized eigenvector of \(X^T X\) with eigenvalue \(\lambda\) and that \(\lambda\) is the largest eigenvalue. Then, \[\vect{v}^T X^T X \vect{v} = \lambda \vect{v}^T \vect{v} = \lambda.\] At that point, it becomes pretty easy to believe that we’ve found the direction of maximum variability. This is certainly true, if \(X^T X\) has a complete set of linearly independent eigenvectors.
Strategy of the proof
A complete proof can be found by solving the constrained optimization problem \[\begin{aligned}
&\text{maximize } \vect{v}^T X^T X \vect{v} \\
&\text{subect to } \|\vect{v}\| \leq 1.
\end{aligned}\] That can be done fairly easily using the technique of Lagrange multipliers.
Moving beyond one dimension
Once you’ve found the direction of largest variation, you can project the remaining data along that direction onto the \(M-1\) dimensional subspace called the orthogonal complement. You can then find the direction of next greatest variability.
An applied example
Let’s look at an example using a classic data set on wine. There are 178 wines in the data but we’ll examine just a sample of size 5:
The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators. There are thirteen different measurements taken for different constituents found in the three types of wine.
The idea is to predict the type of wine given the chemical data. The “type” of wine is determined by the grape or grapes used in making it. In Italian wines, those might be Pinot Gris, Merlot, Sangiovese, or any of quite a few others. There’s only three types in this data set and we’re going to try that type using logistic regression without and with PCA.
Setting up the data
Let’s split the data into training and testing sets:
X = wine.drop(columns ='variety')y = wine.varietyX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=1, stratify=y)
Pure logistic regression
Now, let’s train a pure logistic regression and print the accuracy results:
The first regression attains 98% accuracy using all thirteen variables. The second regression attains only 96% accuracy but uses only two principal compoents as predictors.
In a situation where we have tens of thousands of variables, that could be a huge deal!
Principal components
Here are the proportions of variance for each of the 13 principal compoents.