# Surprise Measure

Abstract

Learning is the process of acquiring knowledge. In psychology, learning is defined as the modification of behavior through training. In this work, we define learning as the modification of a system model to incorporate the knowledge acquired by new observations. During learning, a system creates and modifies its model to improve performance. As new samples are introduced, the system updates its model based on the new information provided by the samples. However, this update may not necessarily improve the model.

A Bayesian surprise metric was proposed to differentiate good data (beneficial) from outliers (detrimental), and thus help to selectively adapt the model parameters. The surprise metric is calculated based on the difference between the prior and the posterior distributions of the model when a new sample is introduced. The metric is useful not only to identify outlier data, but also to differentiate between the data carrying useful information for improving the model and those carrying no new information (redundant). Allowing only the relevant data to update the model would speed up the learning process and prevent the system from overfitting.

Relation to Information Theory

A quantitative definition of the relevance a new observation has to the system is required to subjectively measure the information available in the observation based on the current system knowledge.

The information content of an outcome $x$, whose probability is $p(x)$, is defined as $I(x) = log \frac{1}{p(x)}$ . For a random variable (r.v.) $X$, the average information content is defined as $H(X) = \sum_{x} p(x) \log \frac{1}{p(x)}$, which is also called Shannon’s entropy of r.v. $X$.

To account for the subjective information which the outcome $x$ brings to a system with its own subjective belief, we need to consider its subjective probability $q(x)$. The subjective information content of outcome $x$ to the system with subjective probability $q(x)$ is then defined as $I_s(x) = \log \frac{1}{q(x)}$. The average subjective information, $H_s(X)$, is given by the expectation value of the subjective information, $I_s(x)$, taken with the true probabilities $p(x)$ as $H_s(X) = \sum_{x} p(x) I_s(x) = \sum_{x} p(x) \log \frac{1}{q(x)}$.

Since $H_s(X)$ measures the uncertainty of a system that does not know the correct probabilities, it should be larger than the uncertainty of an ideal observer that knows the true probabilities. Thus, we can state that $H_s(X) \geq H(X)$, with equality being true only when the system has full knowledge of its environment. The difference between $H_s(X)$ and $H(X)$ defined as:

$H_m(X) = H_s(X) - H(X) = \sum_{x} p(x) \log \frac{\displaystyle p(x)}{\displaystyle q(x)}$,

represents the information that the system is missing from the data and that can be learned by adjusting the subjective probabilities closer to the true probabilities. This is called the information gain and is the basis for surprise as it measures the system’s ignorance. This can be understood from the fact that $H_m(X)$ is nonnegative and vanishes if and only if $p(x)$ and $q(x)$ coincide (Kullback-Leibler divergence).

Surprise Metric

Surprise is a subjective information measure that quantifies how much information a new observation contains in relation to the current knowledge of the system. It exists only in the presence of uncertainty and is related to the expectations of the system, where the same data carries different amount of surprise to different systems or to the same system at different times.

To quantify the surprise factor of an observation, we measure the amount of information gained by a model from the introduction of a new observation. We compare the probability of a model $M$ given a new data observation $x_n$, $P(M|x_n)$, against its prior $P(M)$ over all models $M \in \mathcal{M}$ [1]. The surprise factor is measured using the information gain as:

$S(x_n,\mathcal{M}) = H_m(x_n) = D_{KL}\left(P(M|x_n),P(M)\right)$.

Since Kullback-Leibler divergence, $D_{KL}$, does not yield an analytic closed-form expression, we use a different divergence measure, the Cauchy-Schwarz divergence $D_{CS}$ [2], whose closed-form solution is easily computed. $D_{CS}$ is derived from the Cauchy-Schwarz inequality $\sqrt{\int{f^2(x)dx}\int{g^2(x)dx}} \geq \int{f(x)g(x)dx}$, where the equality holds if and only if $f(x)=Cg(x)$ for a constant scalar $C$. The Cauchy-Schwarz divergence measure of two PDFs is defined as:

$D_{CS}(f,g)=-\log\left(\frac{\int f(x)g(x)dx}{\sqrt{\int f(x)^{2}dx\int g(x)^{2}dx}}\right)$.

$D_{CS}(f,g)$ is a symmetric measure and $D_{CS}(f,g)\geq0$, where the equality holds if and only if $f(x)=g(x)$. However, the triangle inequality property does not hold, so it cannot be considered as a metric. $D_{CS}(f,g)$ can be broken down and rewritten as $\mathcal{D_{CS}}(f \| g) = -2\log\int{f(x)g(x)dx} + \log\int{f^2(x)dx} + \log\int{g^2(x)dx}$, where the argument of the first term, $\int{f(x)g(x)dx}$, estimates the interactions on locations within the support of $f(x)$ when exposed to the potential created by $g(x)$ (or viceversa). This term is Renyi’s quadratic cross entropy and measures the similarity between the two PDFs. It can also be interpreted as the information gain from observing $g$ with respect to the “true” density $f$ and plays an important role in calculating surprise as it measures the amount of new information that is gained from the new observation. The other two terms are the negative Renyi’s quadratic entropies of the respective PDFs and are considered as normalizing terms. The surprise measure using Cauchy-Schwarz divergence is then defined as:

$S(x_n,\mathcal{M})= 2\mathcal{H}_2(P(M|x_n); P(M)) - \mathcal{H}_2(P(M|x_n)) - \mathcal{H}_2(P(M))$,

where $\mathcal{H}_2(P(M|x_n); P(M))$ is the quadratic Renyi’s cross-entropy, and $\mathcal{H}_2(P(M)), \mathcal{H}_2(P(M|x_n))$ are Renyi’s quadratic entropy of model prior and posterior distributions respectively.

In essence, surprise measures the change to the posterior distribution caused by the information on the new observation $x_n$. If the posterior is the same as the prior, the observation carries no new information, thus leaving knowledge about the model unaffected. However, if the posterior significantly differs from the prior, the observation carries an element of surprise that can be used to either improve the model (reasonable difference) or significantly change it (drastic difference). We use the surprise value to categorize $x_n$ into one of three categories:

• Outlier: $S(x_n, \mathcal{M}) > T_{out}$, the observation will be discarded as unwanted/outlier.
• Interesting: $T_{out} \geq S(x_n, \mathcal{M}) \geq T_{red}$, the observation contains useful information and should be used to update the current belief of the model $M$.
• Redundant: $T_{red} > S(x_n, \mathcal{M})$, the observation lacks uncertainty and while it is included as a member of the model $M$, posterior parameters are not updated. (As the number of observations increases, redundant samples become more likely, thus by electing not to update the parameters, overfitting is prevented.)

Surprise behavior

The figure below shows the surprise field of a model generated from a dataset whose samples are shown in blue dots. To measure the surprise factor of a new sample $x_n$, the sample is included as part of the model and its parameters are recomputed (in this example, model is represented as a mixture of Gaussians).The surprise value of two data samples is also shown in the figure: a good data sample (S=0.0080), and an outlier (S=0.2166). As expected, the surprise factor for a good sample is very small compared to an outlier. The surprise field also reiterates this point and illustrates that as data samples move further away from the model data, their respective surprise values become larger.

The surprise factor depends on the system knowledge, and as the number of samples introduced to the system increases the surprise value decreases. This poses a challenge when utilizing surprise in clustering problems as each cluster has its own model and surprise depends on the state of the model. The two figures below illustrate the problem that we face in classification/clustering. Two clusters are represented in blue squares and red circles and the new sample is represented by a black star. By intuition, we would expect that the new sample should belong to the left cluster, however the surprise value of the cluster on the right is much smaller, 0.007 compared to 0.025 for the cluster on the left. This occurs because the surprise value decreases as the number of samples introduced to the system increases, and the right cluster model has observed more samples than the left cluster. This results in the surprise value of any sample, even an outlier in this case, to be smaller than that of the left cluster model. There are many ways to mitigate this depending on the problem you are solving and the model chosen for the system. A quick solution we used in this example was to normalize surprise by the high-threshold. The normalized surprise value provides a fair comparison among clusters where the surprise value for the left cluster is 2.125 and for the right cluster is 3.704.

Experiments and Results

We have utilized surprise on several clustering/classification and outlier detection problems.

Browse the links below to see the details and results of these tests:

(to be continued)

this work is currently under review, more information to follow.

References

1. L. Itti and P. F. Baldi, “Bayesian surprise attracts human attention,” in Advances in Neural Information Processing Systems, Vol. 19 (NIPS2005).
2. J. C. Principe, “Information Theoretic Learning, Renyi’s Entropy and Kernel Perspectives”. New York, NY: Springer, 2010.