Principal Component Analysis (PCA)

Lecture Notes on 9-10-2023
Author
Published

October 11, 2023

Author
Published

October 11, 2023

1 Continuous latent space

The idea is of latent variables is that the data is described in the low dimensional latent space and somehow it can map it to this hight dimensional space.

2 Principal Component Analysis (PCA)

Imagine I have the bottom right plot in multiple dimensions so not only 2D dimensions. I can compute a mean and a covariance matrix to fit this data.

Now I pick only line and I will fit the data in 1D-dimension i.e using Gaussian with the mean and convariance from the original plot. This mapping depends on the direction on the line for the PCA, imagine now the green line below then the gaussian will look different


  • So we want to find the direction in which it maximizes the variance data. This is important because if you have a low variance then all the points map to this compressed graph and may look like one point when in reality there are plenty.

3 Recall orthonomal projections

  • The span of a vector.

4 1D Projection

Remarks:

  • Here the \(z_1\) is the projected point into the 1D-dimensional space if we would like more than one dimension then this z_1 would be now a vector \(\textbf{z}_1\) with dimensions \(M\)

5 Maximizing the variance of 1 component

The principal components are the \(u_1\) the lines (the directions)


5.1 PCA via maximum variance


5.2 Reminder: eigen de-composition


Remarks:

  • The total variance of our datapoints in the new dimensional space can be calculated by summing up the eigenvalues
  • We should think of \(U\) as a change of basis from the D-dimensional space to the new dimensional space. This new dimensional space its determined by how many eigevector at the end we choose
  • Because the eigenvectors are orthonormal that means when we apply do the change of basis with the A_weird we are decorrelating our data.

5.3 How to choose M?


5.4 Feature Decorrelation


5.5 Applications: Whitening


6 Probabilistic PCA

Recall difference in Discrete and Continuous latent variable
  • Discrete latent models: k-means, Gaussian Mixture models
  • Continuous latent models: probabilistic PCA, unsupervised regression

If you compare k-means as latent clarifier then this can be consider as an unsupervised regression

If you compare Gaussian-mixture model then this can be though as continuous PPCA model

Probabilistic view of PCA:

  • Learn it via maximum likelihood
  • (Third) alternative view of PCA
  • Both latent and observed variables are Gaussian

6.1 Continuous latent variable model


  • We assume that there is a latent space z from which we can sample a point z.
  • Then I have this conditional p(x|z) given z what is the probability that is lands

6.2 PPCA modelling assumptions


Recall z was the latent variable, the hidden variable the one its making something that we have x. Remember toughs –> words

  • We assume that x is formed by a linear combinations with the latent variable z, W, \(\mu\) and \(\epsilon\)
    • Here we are saying that there is a linear relation between z and x.
  • Here, \(W\), \(\mu\) and \(\epsilon\) are the parameters that we want to recover
  • Because z is Gaussian and noise is also modelled by Gaussian, then \(x\) will also be Gaussian

6.3 It follows


  • In the covariance part we take out W because its not a random variable

6.4 Probabilistic PCA in a picture


From now on we can find the parameters by doing MLE

6.5 The log-likelihood


6.6 PPCA has closed-form solutions


6.7 PPCA Summary


Three views:

  • Max variance, min reconstruction error, probabilistic

Applications

  • Dimensionality reduction
  • 2D/3D visualization
  • Compression
  • Whitening (de-correlating features)
  • (not mentioned) De-noising: discard the smallest variance features = the noise components (hopefully!)

Limitations:

  • Only linear transformations

6.8 Comparing PCA & PPCA

The PCA can be expressed as the maximum likelihood solution probabilistic PCA.

Advantages of the probabilistic PCA over the conventional PCA:

  • We can associate a likelihood function to the probabilistic PCA which allows a direct comparison with other probabilistic density models
  • Probabilistic PCA can be used to model class-conditional densities and can thus be used in classification problems
  • We can run the model generatively to provide samples from the modeled distribution.

7 Non Linear PCA

Find the subspace that maximizes the variance of the projection or minimizes the reconstruction error

By linear we mean all the data is cluster around one contour line like the green line. The question is can we do non-linear PCA where we have not the reed line which is clearly non-linear?

8 PCA using basis functions

The question is how do we map these non-linear data to a linear line such that we use basis function. We can do this in two ways:


  1. Using Neural Networks
  2. Using kernel methods: we can do the mapping from original space to the new space without explicit modeling the basis functions
How to craft Basis functions?
  1. We can use Neural Networks. For instance \(\mathbf{\phi(\textbf{x})} = NN(\textbf{x})\), where \(\textbf{x}\) is the input vector
  2. We can use Kernel methods where we do not make explicit use of basis functions

We will first sketch the structure of an autoencoder

9 Auto-encoders (auto-associative neural nets)


If \(f(x)=w^Tx\) is a linear projection that goes from D to M then this resembles to PCA where we have \(z=\mu_M^T(x-\bar x)\). So if the function \(f(x)\) is of form like in the PCA case then we have a linear projection that can transform our points to a lower dimensional space

10 Autoencoder objective


  • The mapping from input through the encoder, we call it the latent \(z_n\). The later is like a compression.
  • Before we will project the data into our principal components but now we let the NN learn what the latent mapping should be
  • When we carry out this encoder and decoder without using activation functions we end up in the PCA structure
  • If we use however activation function and we use more layers than 2, then we have a non-linear model which has no closed solutions and therefore we can solve it by SGD because our error is non-convex

11 Autoencoder as generator


12 Kernel PCA

We are use to express things in terms of features vectors, i.e the latent is obtained by taking the features vectors and projecting it onto a lower dimensional space

Now we are going to report our results in terms of the kernel \(k(\textbf{x}\textbf{x}_n)\). The result is that the projection would be purely in terms of the other data points via the kernel but not by my other parameters

This is another way how to do PCA for non-linear