Understanding PCA
I am writing this article because the material I found for PCA on the internet is a bit too practical. They explain PCA like steps of an algorithm while missing out the mathematical gist behind its working. So this post would make you feel sleepy, as I would be diving deep into the working of PCA. For those who don’t know what PCA is, this is the complete stop for their reference. And for those who know how to use/implement PCA, this would increase their understanding of PCA and the mathematics behind it. Hold tight and don’t let your eyelids fall.
What is PCA ?
Real-world datasets are complex and large. Large in both the number of data points and the number of dimensions representing each data point. This is because, in machine learning, a very simple observation is that the more data your model is trained on the more generic it gets. So when preparing the sample out of the population, beforehand you don’t know about how many variables would be enough to capture a data point, so you try to capture every feature that you can. This makes the feature set very large but many times these features are not different between themselves i.e. many of the features can be derived from some other feature set in the same dataset. If you somehow remove those redundant features then you can reduce the cardinality of the dimensions of your dataset. Reducing the number of features helps a lot as it reduces a lot of computation time. PCA is a dimensionality reduction technique. PCA derives new dimensions from the old dimensions which are rich in capturing the variance of the data and linearly independent (any one of these dimensions cannot be derived by any linear combination of the other dimensions). Those of you not interested in mathematics can jump to the last section (Algorithm).
Let’s dive
Let,
\(\begin{align}
m &= \text{number of datapoints}\newline
n &= \text{number of features}\newline
X &= m \times n \;\;\;\text{input matrix of dataset where columns represent the features and rows represent the datapoints} \newline
x_i &= i^{th} \text{ row of }X
\end{align}\)
The intuition behind PCA is projecting the data on a new basis such that:
- Data has a high variance along the new basis vectors.
- The basis vectors are linearly independent i.e. orthonormal
Let,
\(\begin{align}
P &= Projection\; matrix.\;n \times n \; matrix \; whose\; columns \; are\; basis\; vectors (orthonormal)\newline
p_i &= i^{th} \; vector\;of\;P
\end{align}\)
We want to represent each data point in terms of these vectors.
Basically,
\(\begin{align}
x_i &= a_{i1}p_1+a_{i2}p_2+a_{i3}p_3+.....+a_{in}p_n \tag{1}\newline
p_i^ \intercal p_j &= 0\;\;\;\; \text{for} \; i \ne j. \;\text{As new basis is orthonormal and,}\newline
p_i^\intercal p_i &= I \;\;\;\; \text {As we want our basis vectors to be unit vectors}\newline
\end{align}\)
Taking the transpose of the equation $(1)$ and having dot product with $p_j$ from the right side gives us $a_{ij}$.
\(\begin{align}
a_{ij} &= x_i ^\intercal p_j\newline
\therefore \hat{x}_i &= x_i^\intercal P \newline
\therefore \hat{X} &= X ^\intercal P \tag{2}
\end{align}\)
Now, let us see how to produce covariance matrix $C$ out of data. $C_{ij}$ is the covariance between $i^{th}$ and $j^{th}$ feature in the data. So, C is $n \times n$.
\(\begin{align}
C_{ij} &= \frac{1}{m} \sum_{k=1}^{m} (X_{ki} -\mu_i)(X_{kj}-\mu_j) \newline
&= \frac{1}{m} \sum_{k=1}^{m} (X_{ki}X_{kj} -\mu_iX_{kj} - \mu_jX_{ki} + \mu_i\mu_j) \;\;\;\;
\text{Let's assume the data has zero mean and unit variance.} \newline
\therefore C_{ij} &= \frac{1}{m} \sum_{k=1}^{m} X_{ki}X_{kj} \newline
C_X &= \frac{1}{m} X^\intercal X \tag{3}
\end{align}\)
It is important to note that to arrive at such a simplified form we need to standard normalize the data.
Also, $X^\intercal X$ is a symmetric matrix $ \because (X^\intercal X)^\intercal = X^\intercal(X^\intercal)^\intercal = X^\intercal X $. In other perspective, we can see that this matrix defines covariance and covariance is commutative, so $C_{ij} = C_{ji}$.
We want the covariance matrix of transformed data so that we can minimize the covariance between the features (make it zero). Let’s see if we can get covariance in $\hat{X}$ in this simple form. To get this, we need the $\hat{X}$ to be zero mean. But we already did that by making $X$ zero mean.
Proof: Let’s examine feature-wise sum of both $X$ and $\hat{X}$. $1^\intercal X$ gives us the feature-wise mean of $X$ and this will be a zero row vector.
To Prove: $1^\intercal \hat{X}$ is a zero row vector.
\(Eq (2) \implies 1^\intercal \hat{X} = (1^\intercal X) P =0\)
This allows us to write the covariance matrix of our transformed data as $C_{\hat{X}} = \frac{1}{m}\hat{X}^\intercal \hat{X}$
Ideally, we want, \(\begin{align} (C_{\hat{X}})_{ij} &= 0 \;\;\;\; i\ne j \newline (C_{\hat{X}})_{ij} &\neq 0 \;\;\;\; i=j \end{align}\)
In other words, we want $C_\hat{X}$ to be diagonal.
\(\begin{align}
C_\hat{X} &= \frac{1}{m}\hat{X}^\intercal \hat{X} = \frac{1}{m}(XP)^\intercal(XP) = \frac{1}{m}P^\intercal X^\intercal XP \newline
C_\hat{X} &=P^\intercal \Sigma P \;\;\;\; \text{where }\; \Sigma = \frac{1}{m}X^\intercal X \tag{4}
\end{align}\)
We want $P^\intercal \Sigma P = D \;\; i.e.\; Diagonal\;matrix$
To find $P$ and $D$, Let’s see Eigen Value Decomposition.
Eigen Value Decomposition
Eigen vectors are special kind of vectors which are very less affected during transformations (space/basis transformations). When a transformation is applied these vectors only scale but doesn’t change directions.
\(\begin{align}
Av &= \lambda v \\ \tag{4.1}
\text{where}, A&:\text{the transformation applied}\\
v &: \text{the eigen vector}\\
\lambda &: \text{constant scalar | Eigen value of }v
\end{align}\)
Each eigen vector is associated with its corresponding eigen value. Writing this for many eigen vectors in matrix form:
\(\begin{align}
AV &= A\begin{bmatrix}
\uparrow & \uparrow & & \uparrow \\
v_1 & v_2 & ... & v_n\\
\downarrow & \downarrow & & \downarrow \\
\end{bmatrix}
=
\begin{bmatrix}
\uparrow & \uparrow & \uparrow \\
Av_1 & Av_2 & Av_n\\
\downarrow & \downarrow & \downarrow \\
\end{bmatrix}
\\ \\
&= \begin{bmatrix}
\uparrow & \uparrow & & \uparrow \\
\lambda_1 v_1 & \lambda_2v_2 & ... & \lambda_nv_n\\
\downarrow & \downarrow & & \downarrow \\
\end{bmatrix}
\\
AV &= V\Lambda ,\; where \; \Lambda=
\begin{bmatrix}
\lambda_1 & 0 & 0 & ... & 0 \\
0 & \lambda_2 & 0 & ... & 0 \\
0 & 0 & \lambda_3 & ... & 0 \\
. & . & . & . & . \\
0 & 0 & 0 & 0 & \lambda_n \tag{4.2} \\
\end{bmatrix}
\end{align}\)
If $A^{-1}$ exists, then we can write ( Invertibility of matrix):
\(\begin{align}
&A = V \Lambda V^{-1} \;\;\;\; \text{This is called Eigen value decomposition of A} \tag{4.3}\\
&V^{-1}AV = \Lambda \;\;\;\; \text{This is called Diagonalization of A}
\end{align}\)
Now let’s slowly move to our case. Consider the case when $A$ is a square symmetric matrix.
Theorem: The Eigenvectors of square symmetric matrix are orthogonal
Proof:
\(\langle Ax,y \rangle = \langle x,A^\intercal y \rangle \;\;\;\;\text{ Matrix multiplication on inner product} \tag*{}\)
Now, $A$ is symmetric. Let $x$ and $y$ be eigenvectors of $A$ corresponding to distinct eigen values $\lambda$ and $v$. Then,
\(\lambda \langle x,y \rangle = \langle \lambda x,y \rangle = \langle Ax,y\rangle = \langle x, A^\intercal y \rangle = \langle x, \mu y \rangle = \mu \langle x, y \rangle \\
\begin{align}
\therefore \lambda \langle x, y\rangle &= \mu\langle x,y\rangle \\
(\lambda - \mu)\langle x,y\rangle &= 0 \\
\langle x, y\rangle &= 0 \;\;\;\; (\because \lambda \ne \mu) \\
\therefore x \perp y \\ \tag*{}
\end{align}\)
Also, consider the case when these eigen vectors are of unit norm. $||x||=1$.
\(\begin{align*}
V^\intercal V &=
\begin{bmatrix}
\leftarrow & v_1 & \rightarrow \\
\leftarrow & v_2 & \rightarrow \\
& . & \\
\leftarrow & v_n & \rightarrow
\end{bmatrix}
\begin{bmatrix}
\uparrow & \uparrow & & \uparrow \\
v_1 & v_2 & ... & v_n \\
\downarrow & \downarrow & & \downarrow
\end{bmatrix}
= I \\
\therefore V^{-1} &= V^\intercal \tag{4.4}
\end{align*}\)
Eventually, our eigen decomposition takes the form of…..
$$ A = V\Lambda V^\intercal \tag{4.5} $$And the diagonalization of A would be
$$ V^\intercal A V = V \tag{4.6} $$Comparing equation $(4)$ and $(4.6)$, we get
\(\begin{align}
D &= C_\hat{X} = \Lambda\; \text{ Diagonal matrix of Eigenvalues }\\
\Sigma &= \frac{1}{m}X^\intercal X = A\\
P &= V\;, \;\; \text{ the eigen vectors of A}\\ \tag*{}
\end{align}\)
Let’s have the summary so far. We wanted to transform our data to a new basis which has low covariance and low noise. We found out that the new basis would be eigen vectors of $\dfrac{1}{m}X^\intercal X$. The new basis is orthogonal. As $\dfrac{1}{m}X^\intercal X$ is symmetric, the eigen vectors are linearly independent. The transformation of data to such a new basis is called Principal Component Analysis. Doing this we can inherently lead to dimension reduction i.e. If we find one or more eigen vectors/values to be zero. But we can explicitly reduce the dimension by choosing the eigen vectors along which the variance is the highest (dropping low variance eigen vectors). In practise, we choose $top-k$ eigen vectors.
Let’s find out which eigen vectors we should drop so that we do not lose much information.
Reconstruction error
We can represent any vector $x_i$ using our newly found $n$ eigen vectors $P = p_1, p_2, p_3, …. , p_n$ as their linear combination. This is going back to equation $(1)$.
\(\begin{align}
x_i &= a_{i1}p_1+a_{i2}p_2+a_{i3}p_3+.....+a_{in}p_n \\
x_i &= \sum_{j=1}^{n}a_{ij}p_j \tag*{}
\end{align}\)
But we want to select only k vectors, so,
\(\hat{x}_i =\sum_{j=1}^{k}a_{ij}p_j \tag*{}\)
Let’s see what error we would be introducing while doing so. We will calculate the squared error and try to minimise it.
\(\begin{align}
e &= \sum_{i=1}^{m}(x_i - \hat{x}_i)^\intercal (x_i - \hat{x}_i) \\
&= \sum_{i=1}^{m}(\sum_{j=1}^{n}a_{ij}p_j - \sum_{j=1}^{k}a_{ij}p_j)^2 \\
&= \sum_{i=1}^{m}(\sum_{j=k+1}^{n}a_{ij}p_j)^2 \\
&= \sum_{i=1}^{m}(\sum_{j=k+1}^{n}a_{ij}p_j)^\intercal (\sum_{j=k+1}^{n}a_{ij}p_j) \\ \tag*{}
\end{align}\)
Here, $a_{ij}$ are constants so $p_j$’s are being multiplied to one another. P is formed of eigen vectors which are orthonormal. Eventually, $p_j^\intercal p_j = 1$ and $p_i^\intercal p_j$ for $i \ne j$.
\(\begin{align}
e &= \sum_{i=1}^{m}\sum_{j=k+1}^{n} a_{ij}^2 \\
&= \sum_{i=1}^{m}\sum_{j=k+1}^{n} (x_i ^\intercal p_j)^2 ;\;\;\;\text{from}\;(2) \\
&= \sum_{i=1}^{m}\sum_{j=k+1}^{n} (x_i ^\intercal p_j)^\intercal (x_i ^\intercal p_j) \\
&= \sum_{i=1}^{m}\sum_{j=k+1}^{n}p_j^\intercal x_i x_i ^\intercal p_j \\
&= \sum_{j=k+1}^{n}p_j^\intercal (\sum_{i=1}^{m} x_i x_i ^\intercal) p_j \\
&= \sum_{j=k+1}^{n}p_j^\intercal (X^\intercal X) p_j \\
&= \sum_{j=k+1}^{n}p_j^\intercal mC p_j; \;\;\;\;\; \text{from}\;(3)\tag{5} \\
\end{align}\)
To minimise this value, let’s find the solution to $\min_{x} x^ \intercal C x$ given the constraint $x^\intercal x=1$. This is a constrained optimisation problem so we will use Lagrangian multiplier for our solution.
\(\begin{align}
L &= x^\intercal Cx - \lambda (x^\intercal x - 1) \\
\frac{\partial L}{\partial x} &= 2Cx - 2\lambda x = 0 \\
Cx &= \lambda x \tag{5.2}
\end{align}\)
What a surprise!!! We didn’t even consider $x$ as eigen vector still it turned out to be eigen vector same as our $p$.
Mutiplying equation $(5.2)$ by $x^\intercal$.
\(\begin{align}
x^\intercal Cx &= \lambda x^\intercal x \\
&= \lambda \tag*{}
\end{align}\)
The solution lies in the extremities of eigen values. Largest eigen value would give us maximum value and smallest eigen value will give us the minimum value. The solution to equation $(5)$ lies in the eigen vectors with the smallest eigen values.
While picking top k vectors, we would first sort the vectors according to their eigen values in descending order and pick the eigen vectors corresponding to the first k eigen values and leave others.
Let’s calculate the variance of data using this new basis. Taking equation $(2)$ and seeing the $i^th$ dimension of new data.
\(\begin{align}
\hat{X}_i &= X p_i \\
\end{align}\)
The variance along this dimesion would be,
\(\begin{align}
\frac{1}{m}\hat{X}_i ^ \intercal \hat{X}_i &= \frac{1}{m}p_i^\intercal (X_i ^ \intercal X_i p_i) \\
&= p_i^\intercal \lambda _i p_i \;\;\;\;\text{ $\because p_i$ is the eigen vector of $\frac{1}{m}X_i ^ \intercal X_i$} \\
&= \lambda _i \\ \tag*{}
\end{align}\)
Voila!!! The eigen values are the variances of each eigen vector. So, we have done nothing wrong by selecting the vectors having maximum eigen values.
That is all to understand how PCA works. Let’s summarise the steps to perform PCA on a dataset.
Algorithm
1) Standardise the data to zero mean and unit variance.
2) Find covariance matrix $ C = \dfrac{1}{m} X^\intercal X $.
3) Find the eigen values and eigen vectors of this covariance matrix.
4) Decide the $k$ for selecting top $k$ eigen vectors. You can do it by deciding how much variance you want to keep by deciding a percentage on it. For example, find k such that,
\(\dfrac{\sum_{i=1}^{k}\lambda _i}{\sum_{i=1}^{n}\lambda _i} \ge 0.95\)
to keep $95\%$ variance in the data. In this step, you will find the number of dimensions being reduced.
5) Create matrix $P$ by aligning these k vectors.
6) Transform the data using $ \hat{X} = X ^\intercal P $.
7) Done
Now you are ready to train any model (classifier/regressor/etc.) model on this transformed data. But be careful while running inference as this also would also need to transform the test data points.