Surprise on Synthetic Data

The goal of these experiments is to demonstrate surprise on clustering/classification and outlier detection. The dataset used here is very popular and comprises of several clusters of varying shapes and sizes as shown in figure below. In addition, noise is added to the dataset as scattered samples and small directional (horizontal or vertical) clusters which are very close or even connected to the clusters of the original data making the identification process difficult.

 We analyze the outlier detection problem using the three fundamental learning approaches:

 

Supervised learning


The system has prior knowledge of both the good data and outliers. Analogous to supervised classification, pre-labeled training data that provide a good representation of the entire distribution are provided to allow good generalization of the system as shown in the left figure below.

The figure on the right shows the results of the supervised approach. As expected, parts of the clusters that are not well represented in the training data are misclassified as outliers (e.g. left parts of the big oval and of the top horizontal strip). In addition, there are many other sporadic misclassifications in regions of the cluster where the training data were sparse.

 

Semi-supervised learning


The system has prior knowledge of the clusters as above, but we also allow the system to adapt and learn online. As new samples arrive, the model is adjusted to learn from the information provided in the samples. The system identifies the boundary of good data / outliers on its own and continues to update it as new samples arrive. The figure below on the left shows the initial training data. Even though there are fewer samples than in the supervised case, since the system will update online, this lack of initial samples does not lower the performance of the system as long as the initial samples cover the entire distribution.

 

The figure on the right shows the results of the semi-supervised approach. Despite the low number of initial data samples, the results of semi-supervised learning are very good, and in fact better than the supervised approach. This is because the system was able to learn from the new samples and adapt its model to integrate the new information.

 

Unsupervised learning


Finally, we determine the outliers with no prior knowledge of the data. If the data are available offline, the unsupervised approach would be simple as the samples that have the highest surprise value would be considered as outliers and the samples with the lowest surprise would be considered as good data. Starting from this initial assumption, we could generate an initial model and then utilize the semi-supervised approach to distinguish the rest of the samples that fall in the middle of the surprise spectrum.

However, in the online case, we do not have all the data available to derive an initial model for the system and an outlier/good-data boundary. Gaussian Mean Shift (GMS) is used to obtain an initial cluster(s) from samples in the outlier bin. GMS is used throughout the process to identify any new clusters. Also, different from the two approaches above, in unsupervised learning, data samples are presented in order from left to right and up to down. This is done to allow GMS to come up with initial clusters for the model. The figure on the left shows the results of unsupervised learning up to a mid-point through the dataset, and the one on the right shows the final results. The unsupervised approach models the clusters well, however it tends to incorrectly classify as outliers some of the outer areas. This is due to the difficulty of GMS to generate new clusters in these areas and also due to the order of introduction of these samples. It is important to note that this process is more time consuming than the previous approaches.

 

 

The table below summarizes the results of the three approaches. It shows the percentage of good data that were incorrectly classified as outliers (false negatives), and the percentage of outliers that were incorrectly classified as good data (false positives) for each approach. The supervised approach has the most false negatives because learning is done offline on the training dataset and surprise is not utilized to update the model as new samples arrive. This is confirmed by the low false negatives for both semi-supervised and unsupervised approaches. The supervised approach has the lowest number of misclassified outliers due to the algorithm being conservative based on the knowledge extracted from the training data. The semi-supervised approach also performs well on the false positives. For the unsupervised approach, the number of misclassified outliers is much larger because there exists no prior knowledge about the clusters and outliers, and the system tends to misclassify the outliers that are close to the clusters and especially the mini outlier-clusters.

 

Incorrect (%)      Supervised      Semi-supervised      Unsupervised     
Good Data 4.86% 1.24% 1.67%
Outliers 1.32% 1.46% 3.79%