Reconstruct Reverse Principal Component Analysis PCA to Original Raw Dataset

George Jen, Jen Tek LLC

**Principal Component Analysis is one of the popular methods to reduce the dimensions (in simple terms, to reduce the number of feature columns) of the original raw dataset, the resultant/column-reduced dataset after PCA would keep the percentage of information in the raw dataset, such that, the percentage meets the given desirable threshold, for example, 90%.**

**Here is step by step process for the PCA:**

- Let X
*nxm*(*n*rows and*m*columns) be the original raw dataset. - Get the Covariance Matrix of X
*nxm*, a covariance matrix is always a m x m square matrix, and assign it to Y*mxm*. - Get m Eigenvalues and correspondent m Eigenvectors from Y
*mxm*. The number of Eigenvalues = number of Eigenvectors = number of the columns of the original dataset. - Sum up all Eigenvalues
- Sort
*m*Eigenvectors based on their corresponding Eigenvalues in descending order, so the Eigenvector with its largest Eigenvalues is the 1*st*Eigenvector, the Eigenvector with its smallest Eigenvalues is the m*th*Eigenvector. - Assign these sorted Eigenvectors to W
*mxm*, which keeps the sorted order of these Eigenvectors. W*mxm*is a matrix that contains these sorted Eigenvectors. - Iteratively find the
*k*, starting from 1. If the

*sum(top k abs(Eigenvalues))/sum(all abs(Eigenvalues)) > = desired threshold*, such as 0.90, then break the iteration and pick this *k*.

Now multiply the original X*nxm* with W*mxm*, assign the resultant matrix to Z*nxm*, which is* *X*nxm* projected by PCA, with the same dimension.

8. Dataset Z*nxm* has 100% of the information of X*nxm*, each feature column in the dataset (matrix) Z*nxm* is arranged, such that, the 1*st* (leftmost) column of Z*nxm* is called 1*st* principal component, the 2*nd* column is called the 2*nd* principal component, …

9. You can get reduced dimension dataset Z*~nxk* by taking top *k* (obtained prior step) columns starting from the 1*st* column, the 1*st* principal component, to the k*th* column, the k*th* principal component, and that dataset Z*~nxk* would have the 90% amount of information from the original raw dataset X*nxm*. Obviously, *k* < *m, *to be considered dimension reduction.

10. This is how you have achieved dimension reduction using PCA, from X*nxm* to Z*~nxk*.

11. You can use Z*~nxk* to train the model instead of X*nxm.*

**Is PCA projected dataset reversible to the original raw dataset?** **No, and yes.**

No, you cannot reverse a dimension-reduced (column-reduced) dataset after PCA to the original raw dataset, specifically, from Z*~nxk* to X*nxm*.

Yes, you can reverse a PCA projected dataset without reducing columns to the original dataset, from Z*nxm* to X*nxm*.

To reverse or reconstruct the original raw dataset from PCA projected dataset, simply reverse the matrix multiplication. Since Z*nxm* = X*nxm* multiplied by W*mxm*, then by math, X*nxm* = Z*nxm* multiplied by the inverse of W*mxm*.

1. Get the inverse of the matrix W*mxm*, and assign the inversed matrix to invW*mxm.*

2. Multiply PCA projected dataset Z*nxm* with invW*mxm.*

3. Result of Z*nxm* multiplied by invW*mxm* is X*nxm*, the original raw dataset.

4. You have reconstructed the original raw dataset from PCA projected dataset.

Following is the code to prove this:

`import numpy as np`

#Original dataset X

X=np.array([[11,2,25],[12,20, 31],[5,6,7],[200,10,22]])

#Get covariance matrix from X, if using numpy.cov,

#make sure set parameter rowvar to False

covX=np.cov(X, rowvar=False)

#Get eigenvalues and eigenvectors from covX

eigenvals, eigenvecs = np.linalg.eig(covX)

#pair the eigenvalue and corresponding eigenvector

pair = [(i[0], i[1]) for i in zip(eigenvals, eigenvecs)]

#Sort the eigenvectors by its eigenvalues in descending order

sortedPair=[[i[0], i[1]] for i in sorted(pair, key=lambda x: abs(x[0]), reverse=True)]

#Get sorted eigenvectors, keep the sort order.

W=np.array([i[1] for i in sortedPair])

def findTopPrincipalComponentNumber(eigenvals:list,threshold:int=0.80) -> (int, float):

''' findTopPrincipalComponentNumber will return k number that will meet the threshold'''

total = sum(eigenvals)

#Sort the Eigenvalues in descending order,

#calculate each eiganvalue that contains percentage of information in raw dataset

eachPercentList = [(i / total) for i in sorted(eigenvals, reverse=True)]

cumulativePercept=0

cumulativePerceptList=[]

for i in eachPercentList:

cumulativePercept+=i

cumulativePerceptList.append(cumulativePercept)

if cumulativePercept >= threshold:

break

return (len(cumulativePerceptList), float(cumulativePerceptList[-1]))

#get the k to have 95% of the original information from raw dataset

bestK, coveredPercentage = findTopPrincipalComponentNumber(eigenvals, 0.95)

#pca projected dataset is X times W

Z=np.matmul(X, W)

#Get bestK columns from Z, to created dimension reduced dataset pcaZ

#that has 98% of the information from original X

pcaZ=[]

for i in Z.tolist():

subZ=[]

for j in range(bestK):

subZ.append(i[j])

pcaZ.append(subZ)

#To reconstruct, can not use pcaZ, need to use Z

#Since Z = X times W, then X = Z times inversed(W)

restoredX = np.matmul(Z, np.linalg.inv(W))

#Restored original dataset, that should match original raw dataset X

restoredX

'''

array([[ 11., 2., 25.],

[ 12., 20., 31.],

[ 5., 6., 7.],

[200., 10., 22.]])

'''

#Original raw dataset

X

'''

array([[ 11, 2, 25],

[ 12, 20, 31],

[ 5, 6, 7],

[200, 10, 22]])

'''

Thank you for the time to read this writing.