Reconstruct Reverse Principal Component Analysis PCA to Original Raw Dataset
George Jen, Jen Tek LLC
Principal Component Analysis is one of the popular methods to reduce the dimensions (in simple terms, to reduce the number of feature columns) of the original raw dataset, the resultant/column-reduced dataset after PCA would keep the percentage of information in the raw dataset, such that, the percentage meets the given desirable threshold, for example, 90%.
Here is step by step process for the PCA:
- Let Xnxm (n rows and m columns) be the original raw dataset.
- Get the Covariance Matrix of Xnxm, a covariance matrix is always a m x m square matrix, and assign it to Ymxm.
- Get m Eigenvalues and correspondent m Eigenvectors from Ymxm. The number of Eigenvalues = number of Eigenvectors = number of the columns of the original dataset.
- Sum up all Eigenvalues
- Sort m Eigenvectors based on their corresponding Eigenvalues in descending order, so the Eigenvector with its largest Eigenvalues is the 1st Eigenvector, the Eigenvector with its smallest Eigenvalues is the mth Eigenvector.
- Assign these sorted Eigenvectors to Wmxm, which keeps the sorted order of these Eigenvectors. Wmxm is a matrix that contains these sorted Eigenvectors.
- Iteratively find the k, starting from 1. If the
sum(top k abs(Eigenvalues))/sum(all abs(Eigenvalues)) > = desired threshold, such as 0.90, then break the iteration and pick this k.
Now multiply the original Xnxm with Wmxm, assign the resultant matrix to Znxm, which is Xnxm projected by PCA, with the same dimension.
8. Dataset Znxm has 100% of the information of Xnxm, each feature column in the dataset (matrix) Znxm is arranged, such that, the 1st (leftmost) column of Znxm is called 1st principal component, the 2nd column is called the 2nd principal component, …
9. You can get reduced dimension dataset Z~nxk by taking top k (obtained prior step) columns starting from the 1st column, the 1st principal component, to the kth column, the kth principal component, and that dataset Z~nxk would have the 90% amount of information from the original raw dataset Xnxm. Obviously, k < m, to be considered dimension reduction.
10. This is how you have achieved dimension reduction using PCA, from Xnxm to Z~nxk.
11. You can use Z~nxk to train the model instead of Xnxm.
Is PCA projected dataset reversible to the original raw dataset? No, and yes.
No, you cannot reverse a dimension-reduced (column-reduced) dataset after PCA to the original raw dataset, specifically, from Z~nxk to Xnxm.
Yes, you can reverse a PCA projected dataset without reducing columns to the original dataset, from Znxm to Xnxm.
To reverse or reconstruct the original raw dataset from PCA projected dataset, simply reverse the matrix multiplication. Since Znxm = Xnxm multiplied by Wmxm, then by math, Xnxm = Znxm multiplied by the inverse of Wmxm.
1. Get the inverse of the matrix Wmxm, and assign the inversed matrix to invWmxm.
2. Multiply PCA projected dataset Znxm with invWmxm.
3. Result of Znxm multiplied by invWmxm is Xnxm, the original raw dataset.
4. You have reconstructed the original raw dataset from PCA projected dataset.
Following is the code to prove this:
import numpy as np
#Original dataset X
#Get covariance matrix from X, if using numpy.cov,
#make sure set parameter rowvar to False
#Get eigenvalues and eigenvectors from covX
eigenvals, eigenvecs = np.linalg.eig(covX)
#pair the eigenvalue and corresponding eigenvector
pair = [(i, i) for i in zip(eigenvals, eigenvecs)]
#Sort the eigenvectors by its eigenvalues in descending order
sortedPair=[[i, i] for i in sorted(pair, key=lambda x: abs(x), reverse=True)]
#Get sorted eigenvectors, keep the sort order.
W=np.array([i for i in sortedPair])
def findTopPrincipalComponentNumber(eigenvals:list,threshold:int=0.80) -> (int, float):
''' findTopPrincipalComponentNumber will return k number that will meet the threshold'''
total = sum(eigenvals)
#Sort the Eigenvalues in descending order,
#calculate each eiganvalue that contains percentage of information in raw dataset
eachPercentList = [(i / total) for i in sorted(eigenvals, reverse=True)]
for i in eachPercentList:
if cumulativePercept >= threshold:
return (len(cumulativePerceptList), float(cumulativePerceptList[-1]))
#get the k to have 95% of the original information from raw dataset
bestK, coveredPercentage = findTopPrincipalComponentNumber(eigenvals, 0.95)
#pca projected dataset is X times W
#Get bestK columns from Z, to created dimension reduced dataset pcaZ
#that has 98% of the information from original X
for i in Z.tolist():
for j in range(bestK):
#To reconstruct, can not use pcaZ, need to use Z
#Since Z = X times W, then X = Z times inversed(W)
restoredX = np.matmul(Z, np.linalg.inv(W))
#Restored original dataset, that should match original raw dataset X
array([[ 11., 2., 25.],
[ 12., 20., 31.],
[ 5., 6., 7.],
[200., 10., 22.]])
#Original raw dataset
array([[ 11, 2, 25],
[ 12, 20, 31],
[ 5, 6, 7],
[200, 10, 22]])
Thank you for the time to read this writing.