Online Kernel PCA

 

Abstract


Describe a new method for performing kernel principal component analysis which is online and also has a fast convergence rate. The method follows the Rayleigh quotient to obtain a fixed point update rule to extract the leading eigenvalue and eigenvector. Online deflation is used to estimate the remaining components. These operations are performed in reproducing kernel Hilbert space (RKHS) with linear order memory and computation complexity.

 

Background


Principal component analysis

Given a set of N centered observations \mathbf{x}_k \in \mathbb{R}^l,\;k=1,\dots,N, such that \sum_{k=1}^{N}\mathbf{x}_k = 0, the covariance matrix is \mathbf{C} = \mathrm{E}\left[XX^{\mathrm{T}}\right] and its estimate is \hat{\mathbf{C}} = \frac{1}{N}\sum\limits_{k=1}^{N}\mathbf{x}_k\mathbf{x}_k^{\mathrm{T}}. PCA orthogonalizes the covariance matrix by solving \hat{\mathbf{C}}\mathbf{u} = \lambda \mathbf{u}, for eigenvalues \lambda \geq 0 and their corresponding eigenvectors \mathbf{u} \in \mathbb{R}^l\setminus \{0\}.

 

Rayleigh quotient algorithm (RQA)

An important result from the Rayleigh-Ritz theorem is that to compute the largest eigenvalue and its corresponding eigenvector, one only needs to maximize the Rayleigh quotient R_{\mathbf{C}}(\mathbf{u}) = \frac{\mathbf{u}^{\mathrm{T}}\mathbf{C}\mathbf{u}}{\mathbf{u}^{\mathrm{T}}\mathbf{u}} with respect to \mathbf{u}. The resulting quotient, R_{\mathbf{C}}(\mathbf{u}), is the largest eigenvalue. Taking the derivative of the quotient with respect to \mathbf{u} and considering that \mathbf{u}^{\mathrm{T}}\mathbf{u} =1, which is valid for eigenvectors, results in \mathbf{u} = \frac{\mathbf{C}\mathbf{u}}{\mathbf{u}^{\mathrm{T}}\mathbf{C}\mathbf{u}}. To solve the PCA problem online, need to estimate the eigenvector at sample n [1] as: \mathbf{u}(n) = \frac{\hat{\mathbf{C}}(n)\mathbf{u}(n-1)}{\mathbf{u}(n-1)^{\mathrm{T}}\hat{\mathbf{C}}(n)\mathbf{u}(n-1)}, where \hat{\mathbf{C}}(n) is the estimate of the covariance matrix after the n^{th} observation and \mathbf{u}(n-1) is the estimate of the eigenvector at observation n-1. Replacing the estimate of the covariance matrix results in {\small \mathbf{u}(n) = \sum\limits_{k=1}^{n}\left(\mathbf{x}_k^{\mathrm{T}}\mathbf{u}(k-1)\right)\mathbf{x}_k / \sum\limits_{k=1}^{n}\left(\mathbf{x}_k^{\mathrm{T}}\mathbf{u}(k-1)\right)^2 }. The online version simplifies to

\mathbf{u}(n) = \frac{v(n-1)}{v(n)}\mathbf{u}(n-1)+\frac{\mathbf{x}_n^{\mathrm{T}}\mathbf{u}(n-1)}{v(n)}\mathbf{x}_n

where v(n) = v(n-1)+\left(\mathbf{x}_n^{\mathrm{T}}\mathbf{u}(n-1)\right)^2 providing a simple rule to update the eigenvector as more samples are introduced.

 

Kernel principal component analysis

Since PCA fails to capture the structure of highly nonlinear datasets, KPCA was introduced as a nonlinear extension to PCA. KPCA first transforms the data into a feature space F of much higher dimension and then computes the PCA of the data in that space. The inner product between the transformed data in the feature space is evaluated using the kernel function \kappa(\mathbf{x}_i,\mathbf{x}_j) = \langle \phi(\mathbf{x}_i),\phi(\mathbf{x}_j)\rangle, where \phi(\cdot) is the nonlinear mapping from the input space \mathbb{R}^l into the feature Hilbert space \mathcal{F}, \phi:\mathbb{R}^l \mapsto \mathcal{F}. The kernel function allows us to compute the inner product between two points in feature space without having to know the mapping \phi(\cdot) explicitly.

Since PCA can be written in terms of inner products, it can be implicitly computed in feature space. Assuming that the transformed data in feature space are also centered, the covariance matrix is then estimated as \hat{\mathbf{C}} = \frac{1}{N}\sum\limits_{k=1}^{N}\phi(\mathbf{x}_k)\phi(\mathbf{x}_k)^{\mathrm{T}}. To perform PCA, since the eigenvectors \mathbf{w} lie within the span of \left\{ \phi(\mathbf{x}_i)\right\}_{i=1}^N, the following equivalent problem is considered: \phi(\mathbf{X}^{\mathrm{T}})\hat{\mathbf{C}}\mathbf{w} = \lambda \phi(\mathbf{X}^{\mathrm{T}})\mathbf{w}, where \mathbf{X} = \left(\mathbf{x}_1,\cdots,\mathbf{x}_N\right) and \phi(\mathbf{X}) = \left(\phi(\mathbf{x}_1),\cdots,\phi(\mathbf{x}_N)\right), and \mathbf{w} could be represented in terms of a N-dimensional vector \mathbf{q} as \mathbf{w} = \phi(\mathbf{X})\mathbf{q}. Defining the N \times N Gram matrix \mathbf{K} = \phi(\mathbf{X}^{\mathrm{T}})\phi(\mathbf{X}) and combining with the two equations above we get $latex \mathbf{K}\mathbf{q} = N\lambda\mathbf{q}$, which provides the eigenvalues and eigenvectors of the data in feature space. The main problem with KPCA is that the Gram matrix size increases with the number of observations, thus becoming computationally infeasible to solve directly and in particular online.

 

Kernel Rayleigh quotient algorithm (KRQA)


If we consider the observations transformed in the feature space, the leading eigenvector is computed as:

\mathbf{w}(n) = \frac{\sum\limits_{k=1}^{n}\langle \phi(\mathbf{x}_k),\mathbf{w}(k-1) \rangle \phi(\mathbf{x}_k)}{\sum\limits_{k=1}^{n}\langle \phi(\mathbf{x}_k),\mathbf{w}(k-1) \rangle^2}

where \mathbf{w}(k) = \sum_{j=1}^{k}\alpha_j^k\phi(\mathbf{x}_j), since all nonzero solutions lie within the span of \{\phi(\mathbf{x}_i)\}_{i=1}^{N}. Using the reproducing kernel property, we have

\mathbf{w}(n) = \sum\limits_{k=1}^{n}\left(\frac{\sum\limits_{j=1}^{k-1}\alpha_j^{k-1}\kappa(\mathbf{x}_j,\mathbf{x}_k)}{\sum\limits_{k=1}^n\left(\sum\limits_{j=1}^{k-1}\alpha_j^{k-1}\kappa(\mathbf{x}_j,\mathbf{x}_k)\right)^2}\right)\phi(\mathbf{x}_k).

Expanding further, we have a similar update rule as in the input space

\mathbf{w}(n) = \frac{v(n-1)}{v(n)}\mathbf{w}(n-1)+\frac{\sum\limits_{j=1}^{n-1}\alpha_j^{n-1}\kappa(\mathbf{x}_j,\mathbf{x}_n)}{v(n)}\phi(\mathbf{x}_n),

where v(n) = v(n-1)+\left(\sum_{j=1}^{n-1}\alpha_j^{n-1}\kappa(\mathbf{x}_j,\mathbf{x}_n)\right)^2. The only difference is that since an actual eigenvector in the feature space does not exist, but it is rather a representation in terms of the previous observations, (n-1) inner products are required to compute the coefficient of the new observation.

 

Insuring the observations are centered

The derivations above assume that the data are centered in \mathcal{F} which is not generally true. Therefore, the mean needs to be subtracted from each sample. The observation \phi(\mathbf{x}_j) is then replaced with \tilde{\phi}(\mathbf{x}_j) = \phi(\mathbf{x}_j) - \frac{1}{n}\sum_{k=1}^{n}\phi(\mathbf{x}_k). The update rule above is extended to include subtraction of mean from each pair of samples. Thus replacing \kappa(\mathbf{x}_j,\mathbf{x}_n) with \kappa(\mathbf{x}_j,\mathbf{x}_n) - \overline{\kappa}(\mathbf{x}_j)-\overline{\kappa}+\overline{\kappa}, where \overline{\kappa}(\mathbf{x}_j) = (1/n)\sum_{l=1}^{n}\kappa(\mathbf{x}_k,\mathbf{x}_l) and \overline{\kappa} = (1/n)\sum_{k=1}^{n}\overline{\kappa}(\mathbf{x}_k).

 

Rule for minor components

The minor eigencomponents can be estimated using the deflation technique. The largest eigencomponent is calculated over the whole space as described above. The next largest eigencomponent is calculated by restricting attention to the subspace which is orthogonal complement of the first eigenvector. The covariance matrix becomes \mathbf{C}' = (\mathbf{I} - \mathbf{u}_1^{\mathrm{T}}\mathbf{u}_1)\mathbf{C}. The observations are also restricted to this subspace \mathbf{x}_k' = \mathbf{x}_k - \mathbf{u}_1(\mathbf{u}_1^{\mathrm{T}}\mathbf{x}_k). The update equations remain the same except that the observations \mathbf{x}_k are replaced with new observations \mathbf{x}_k' where the projection onto the first eigenvector is first removed. This process is then followed for the rest of the eigencomponents where projections onto the second eigenvector are then removed. Need to keep in mind that while updates in input space are very simple and previous values can be used to estimate future ones, this is not that simple in feature space, where, as mentioned above, the exact transformation is not known and thus the actual values cannot be stored. While in the input space the computation complexity always remains \mathcal{O}(1) in the feature space the complexity increases linearly, \mathcal{O}(N), in contrast to the direct solution, \mathcal{O}(N^2).

 

Experiments and Results


Toy Example

To demonstrate the capability of kernel RQA (KRQA) we compare it against KPCA and kernel General Hebbian algorithm (KGHA) in a toy example. The data is generated in the following way: x-values have uniform distribution in [-1,1], y-values are generated from y_i = x_i^2 + \xi, where $\xi$ is normal noise with standard deviation 0.2. Two hundred samples were used in the iterative method (KGHA) until convergence was reached. The same 200 samples were used in our online update algorithm but they were repeated 300 times (since 200 is not enough for only one pass). In both cases, the kernel used was a second order polynomial. The figure below shows an example of part of the dataset and the three eigenvectors representing the data.

The Monte Carlo simulation results for 100 times are shown in the table below, where we compare the accuracy of our method, KRQA, against the iterative method KGHA and the actual eigenvalues obtained from KPCA. The table shows that KRQA is far more accurate than KGHA. In addition, KRQA has a much faster convergence rate only 300 iterations needed compared to 7752 required for KGHA. Recall that in both KGHA and KRQA, the time and memory complexity for each iteration are \mathcal{O}(l e n) and \mathcal{O}((l+e)n), where e is the number of principal components, n is the current number of samples, and l is the input space dimensionality.

KPCA             KGHA              KRQA             
Iterations - 7752 300
λ1 0.32198 ±2.5×10-3 ±1.0×10-5
λ2 0.16891 ±1.9×10-2 ±3.0×10-5
λ3 0.02044 ±3.9×10-4 ±1.0×10-4

 

 Noise Reduction

Test KRQA on noise reduction in data collected incrementally. The data is sampled from a modulated sine wave with Gaussian noise of standard deviation 0.1. The system was trained on 5000 samples presented only once online, and then tested on the set shown in the figure below. A Gaussian kernel with kernel size 0.2 was used to train the system. Three principal components were extracted and used to reconstruct the data samples. Since the manipulations occur in feature space, a preimaging method, \phi^{-1}(\cdot), was used to retrieve the data from the feature space to the input space. The figure below shows the training results where the data samples are in blue, the data transferred to feature space and retrieved, \phi^{-1}(\phi(\mathbf{x}_k)), are in green, and the reconstructed samples are in red. The figure shows that the reconstruction is capable of removing most of the noise in the signal with minor mistakes.

 

Data Reconstruction

Test the method on reconstruction of data using only a small number of principal components. The system was trained with a Gaussian kernel and kernel size 7.5. Handwritten digits were introduced sequentially and the first ten principal components were updated online. The figure below shows the original testing digits and the reconstruction results after 250, 500, 750, and 1000 images per digit were introduced to the system. It is important to emphasize that the system was trained online and the training images were presented only once. The MSE results show how the system continues to improve as more samples are introduced.

The table shows the classification error rates of KRQA and KPCA using least squares regression and one versus all configuration. As expected, the performance improves with the number of PCs extracted. In addition, the KRQA performs fairly close to KPCA.

16           32           64          
KPCA       0.251 0.172 0.124
KRQA 0.261 0.197 0.147