# Content Addressable Memories in RKHS

Abstract

Implement content addressable memories (CAM) in a reproducing kernel Hilbert space (RKHS) to increase the amount of information stored. CAMs are one of the few technologies that provide the capability to store and retrieve information based on content. Even more useful is their ability to recall data from noisy or incomplete inputs. However, the input data dimension limits the amount of data that CAMs can store and successfully retrieve. We lift this limitation by implementing CAMs in RKHS where the input dimension is practically infinite.

Background – CAMs

Content addressable memories follow on the principle of human memory where learning is done through association, where events are linked to one another in such a way that the occurrence of an event, stimulus, triggers the emergence of another event, response. This association is strengthened through time by the constant trigger of the response via the stimulus event; a learning process that is known as Hebb’s rule. The connections between the patterns are stored in a matrix ${\bf W}$, computed using the outer product rule ${\bf W} = \eta \cdot {\bf d} \cdot {\bf x}^{T}$. The output is retrieved by multiplying the connection matrix with the input vector as follows:

${\bf d}_{r} = {\bf W} \cdot {\bf x}_{r} = \sum^{N}_{i=1} {\bf d}_{i} \cdot {\bf x}^{T}_{i} \cdot {\bf x}_{r} = \sum^{N}_{i=1} {\bf d}_{i} \cdot \left\langle {\bf x}_{i}, {\bf x}_{r} \right\rangle$.

CAMs work well when the vectors are orthogonal. For example, assume the input vectors ${\bf x}$ are normalized and orthogonal, then for every pair ${\bf x}_{i} \rightarrow {\bf d}_{i}$ of associations we have an associative matrix ${\bf W}_{i} = {\bf d}_{i} \cdot {\bf x}^{T}_{i}$. The overall matrix $\textbf{W}$ is then the sum of all these individual matrices ${\bf W} = \sum_{i}{\bf W}_{i}$. If we want to retrieve $\textbf{d}_{j}$ associated with $\textbf{x}_{j}$ we proceed as follows:

${\small {\bf W} \cdot {\bf x}_{j} = \sum_{i} {\bf W}_{i} \cdot {\bf x}_{j} = \sum_{k \neq j} {\bf W}_{k} \cdot {\bf x}_{j} + {\bf W}_{j} \cdot {\bf x}_{j} = \sum_{k \neq j} {\bf d}_{k} \cdot \underbrace{{\bf x}^{T}_{k} \cdot {\bf x}_{j}}_{=0} + {\bf d}_{j} \cdot \underbrace{{\bf x}^{T}_{j} \cdot {\bf x}_{j}}_{=1} = {\bf d}_{j} }$.

Since the input vectors, in general, are not orthogonal, there is potential for cross-talk among different association pairs. In addition, CAMs have limited capacity. The number of pairs that can be successfully stored in the connection matrix is dependent on the dimensionality of the state vectors; e.g., if the data dimensionality is N, there can only be stored N orthogonal vector pairs without interference. This number decreases even more when the orthogonality rule is not followed.

Error correction

To remove cross-talk, an error correction mechanism can be incorporated into the formula. It follows the steepest descent approach and includes learning from the error between the desired vector and the output of the association matrix, ${\bf e} = {\bf d}_{k} - {\bf W} \cdot {\bf x}_{k}$, using the least mean square algorithm as shown in the following equation:

${\bf W}(t+1) = {\bf W}(t) + \mu \cdot \left[{\bf d}_{k} - {\bf W}(t) \cdot {\bf x}_{k} \right] \cdot {\bf x}^{T}_{k}$,

where ${\bf W}(0) = {\bf 0}$. This is a combination of both Hebbian and anti-Hebbian rules. The anti-Hebbian term decorrelates the input and the system output, thus reducing the crosstalk and consequently improving the performance of the associative memory. However, re-training is required whenever a new pair is added, requiring the storing of all previous pairs which defeats the purpose of CAMs.

To increase the memory capacity and reduce the interference among the data stored, we introduce a memory model implemented in RKHS where the input dimension is practically infinite, effectively lifting all CAM limitations. Kernel methods implement a data transformation from the input space into a feature space of usually much higher dimension. The inner product between the transformed data in the feature space is calculated using a kernel function as follows: let $\Phi(\cdot)$ represent the mapping from the input space $X$ into the feature Hilbert space $F$, $\Phi : X \rightarrow F$, then the kernel function is $\kappa({\bf x}_i, {\bf x}_j) = \left\langle \Phi({\bf x}_i),\Phi({\bf x}_j) \right\rangle$. This is shown from the Mercer theorem \cite{Aronszajn50} which states that any non-negative definite kernel can be expanded as: $\kappa({\bf x}_i, {\bf x}_j) = \sum^{\infty}_{k=1} \lambda_k \phi_k({\bf x}_i)\phi_k({\bf x}_j)$, where $\lambda_k$ and $\phi_k$ are the eignevalues and eigenfunctions respectively, and the eigenvalues are non-negative. As a result, the mapping $\Phi({\bf x})$ can be represented as $\Phi({\bf x}) = \left[ \sqrt{\lambda_1}\phi_1({\bf x}), \sqrt{\lambda_2}\phi_2({\bf x}), \dots \right]^T$. This feature space can also be thought of as a reproducing kernel Hilbert space as the span of the functions $\kappa(\cdot, {\bf x}) : {\bf x} \in X$ defines a unique functional Hilbert space, where a nonlinear mapping from the input space into an RKHS can be defined as $\Phi({\bf x}) = \kappa(\cdot, {\bf x})$ such that $\left\langle \Phi({\bf x}_i),\Phi({\bf x}_j) \right\rangle = \left\langle \kappa(\cdot, {\bf x}_i), \kappa(\cdot, {\bf x}_j) \right\rangle = \kappa({\bf x}_i, {\bf x}_j)$. We use the Gaussian kernel because, in principal, it produces an infinite dimensional space. To retrieve the desired pattern from its corresponding input vector in the RKHS, we compute:

${\bf d}_r = \sum^{N}_{i=1} {\bf d}_i \cdot \Phi^T({\bf x}_i) \cdot \Phi({\bf x}_r) = \sum^{N}_{i=1} {\bf d}_i \cdot \kappa({\bf x}_i, {\bf x}_r)$,

where the retrieved output is the sum of all the stored output patterns weighed on the closeness of the current stimulus to the stored input patterns. The transformation of the input patterns into RKHS can be thought of as simply extraction of features. The kernel function computes the inner product by implicitly mapping the data into the feature space, thus allowing us to obtain nonlinear transformation in terms of inner products without knowing the exact mapping $\Phi(\cdot)$. Since the $\Phi(\cdot)$ transformation is unknown, the actual input-output pairs need to be stored. During the retrieval procedure the kernel function between the stimulus and all the stored inputs is computed to decide which output vector ${\bf d}_i$ is the desired response.

Given that all the association pairs need to be stored, this method usually requires more storage space than CAMs. E.g. assume that $M$ vector pairs need to be stored where the vector dimension is $N$ for both the input and output. In the case of content addressable memory, the connection matrix is $N \times N$ and thus memory required is $N^2$ regardless of the number of pairs. In the kernel CAM case, there are M $2 \times N$ pairs that need to be stored and thus memory required is $2MN$. Consequently, more storage space will be needed whenever $M > \frac{N}{2}$. There are ways of reducing the number of pairs stored such as KRQA to extract the principal components, or using surprise to identify the most important data samples.

Experiments and Results

Face Recognition

The PIE database was used to test the performance of KCAM on image recall. The database includes 67 subjects on 21 different illumination conditions captured with, and without,background lighting. The figures below shows examples of several subjects captured with the room lights switched off (left), and on (right). Figure, \ref{fig:person1} shows subject 1 under all illumination variations for both light conditions.

Test retrieval capability of KCAM

Test the KCAM recall capability. Images are randomly introduced to the memory (one example per subject). Next, the memory is tested to recall the rest of the ‘not-seen’ examples. The experiment was repeated 100 times and the best, worst, and average recall are shown in the figures below for both light conditions. The figure shows the recall capability of the memory as 1, 2, …, 10 images per subject are introduced. As expected, as more examples are introduced into the memory, its recall capability is improved. Also, the results on the data where the room lights are on shows better performance as the faces are not obscured by the poor lighting conditions when the lights are off.

Test retrieval with occlusion

Test KCAM recall capability on images with occlusion, where the area near the eyes, rows 16 – 35 of the image, is removed. The figure below shows examples of images with the area occluded and the memory recall. On the left, there are two examples of correct recall, and on the right two with the incorrect recall.

The experiment follows the same as before. Images are randomly introduced to memory and the memory is tested on the ‘not-seen’ examples. The experiment was repeated 100 times and the best, worst, and average recall are shown on the top row of figure below for both light conditions. The KCAM associates the input image to images previously seen; however, the input image is missing a significant portion of its information, $31\%$, and its comparison to the images in memory may be misleading since even this missing portion is used during comparison. To lower the significance of this portion of the image during the association, the weight of the pixels belonging to this part on images in memory was reduced to zero. The recall results after this modification are shown on the figure below. When this ‘masking’ is included in the recall the results are very close to the recall without any occlusion.

Test retrieval under the effect of noise

Test KCAM recall capability on images in the presence of additive noise. Salt and pepper noise was considered. The figure below shows the effect of these two types of noise on a particular image. The noise power is increased from 0.05 to 0.5.

The KCAM was assumed to have already observed 5 images per subject, thus having an accuracy of 0.85 and 0.97 on the illumination and lights data respectively. Its capability is then test on the full range of noise. The results are shown in the figure below.

Handwritten Digit Recall

Test the recall capabilities of KCAM on a handwritten digits database. We used a dataset composed of 5,000 images per digit and use 4,000 images per digit to create the memory and the other 1,000 for the recall test. The figure below shows % of correct recall vs # of digits stored in the memory. During this test the best association is chosen for the recall. The recall improves as the number of images stored in memory increases.

The figure below provides an example where recall is provided based on selecting the best (closest) association to the test image and also the weighed average of all the associations. The figure shows that selecting the best or average association has its advantages and drawbacks. This can be shown on the reconstruction of the digits 8 and 9, where ‘best’ is better for 8 but fails on 9, and ‘average’ although blurry in 8 it works well on 9.