Project SAAAD:
Scalable Adaptive Auto-encoded Anomaly Detection

Prepared by Håkan Persson and Raazesh Sainudiin partly for Combient AB.

Project SAAAD aims to explore the use of autoencoders for anomaly detection in various ‘big-data’ problems. Specifically, these problems have the following complexities:

  • data volumes are big and one needs distributed in-memory fault-tolerant computing frameworks such as Apache Spark
  • learning is
    • semi-supervised (so a small fraction of the dataset is labelled) and
    • interactive (with humans-in-the-loop)
  • the phenomena is time-varying

Introduction

Anomaly detection is a subset of data mining where the task is to detect observations deviating from the expected pattern of the data. Important applications of this include fraud detection, where the task is to detect criminal or fraudulent activity for example in credit card transactions or insurance claims. Another important application is in predictive maintenance where recognizing anomalous observation could help predict the need for maintenance before the actual failure of the equipment.

The basic idea of anomaly detection is to establish a model for what is normal data and then flag data as anomalous if it deviates too much from the model. The biggest challenge for an anomaly detection algorithm is to discriminate between anomalies and natural outliers, that is to tell the difference between uncommon and unnatural data and uncommon but natural data. This distinction is very dependent on the application at hand, and it is very unlikely that there is one single classification algorithm that can make this distinction in all applications. However, it is often the case that a human expert is able to tell this difference, and it would be of great value to develop methods to feed this information back to the algorithm in an interactive loop. This idea is sometimes called active learning.

Normally, anomaly detection is treated as an unsupervised learning problem, where the machine tries to build a model of the training data. Since an anomaly by definition is a data point that in some way is uncommon, it will not fit the machine’s model, and the model can flag it as an anomaly. In the case of fraud detection, it is often the case that a small fraction (perhaps 1 out of a million) of the data points represent known cases of fraud attempts. It would be wasteful to throw away this information when building the model. This would mean that we no longer have an unsupervised learning problem, but a semi-supervised learning problem. In a typical semi-supervised learning setting, the problem is to assign predefined labels to datapoints when the correct label is only known for a small fraction of the dataset. This is a very useful idea, since it allows for leveraging the power of big data, without having to incur the cost of correctly labeling the whole dataset.
If we translate our fraud detection problem to this setting it means that we have a big collection of datapoints which we want to label as either “fraud” or “non-fraud”.

Summary of Deep Learning Algorithms for Anomaly Detection

  • Autoencoders (AE) are neural networks that are trained to reproduce the indata. They come in different flavours both with respect to depth and width, but also with respect to how over-learning is prevented. They can be used to detect anomalies from those datapoints that are poorly reconstructed by the network, as quantified by the reconstruction error. AEs can also be used for dimension reduction or compression in a data preprocession step. Subsequently other anomaly detection techniques can be applied to the transformed data.

  • Variational autoencoders are latent space models where the network is trying to transform the data to a prior distribution (usually multivariate normal). That is, the lower dimensional representation of the data that you get from standard autoencoder will be distributed according to the prior distribution in the case of a variational autoencoder. This means that you can feed data from the prior distribution backwards through the network to generate new data from a distribution close to the one of the original authentic data. Of course you can also use the network to detect outliers in the dataset by comparing the transformed dataset with the prior distribution.

  • Adversarial autoencoders have some conceptual similarities with variational autoencoders in that they also are capable of generating new (approximate) samples of the dataset. One of the differences between them is not so much how they model the network, but how they are trained. Adversarial autoencoders are based on the idea of GANs (Generative adversarial networks). In a GAN, two neural networks are randomly initialized and random indata from a specified distribution is fed into the first network (the generator). Then the outdata of the first network is fed as indata to the other network (the discriminator). Now the job for the discriminator is to correctly discriminate forged data coming from the generator network from authentic data. The job for the generator network is to, as often as possible, fool the discriminator. This can be interpreted as a zero-sum game in the sense of game theory, and training a GAN is then seen to be equivalent to finding the Nash-equilibrium of this game. Finding such an equilibrium is of course far from trivial but it seems like good results can be achieved by training the networks iteratively side by side through backpropagation. The learning framework is interesting on a meta level because this generator/discriminator rivalry is a bit reminiscent of the relationship between the fraudster and the anomaly detector.

  • Ladder networks are a class of networks specially developed for semi-supervised learning. It aims at combining supervised and unsupervised learning at every level of the network. The method has made very impressive results on classifying the MNIST dataset, but it is still open how well it performs on other datasets.

  • Active Anomaly Discovery (AAD) is a method for incorporating expert feedback to the learning algorithm. The basic idea is that the loss function is calculated based on how many non-interesting anomalies it presents to the expert instead of the usual loss functions, like the reconstruction error. The original implementation of AAD is based on an anomaly detector called Loda (Lightweight on-line detector of anomalies), but it has also been implemented on other ensemble methods, like tree-based methods. It can also be incorporated into methods that use other autoencoders by replacing the reconstruction error. In the Loda method, the idea is to project the data to a random one-dimensional subspace, form a histogram and predict the log probability of an observed data point. Of course this is a very poor anomaly detector, but by taking the mean of large number of these weak anomaly detectors, we end up with a good anomaly detector.

Questions for AIM Day

Prepared by Håkan Persson and Raazesh Sainudiin for Combient AB.

We aim to explore the use of autoencoders for anomaly detection in various ‘big-data’ problems that have the following complexities:

  • data volumes are big and one needs distributed in-memory fault-tolerant computing frameworks such as Apache Spark
  • learning is
    • semi-supervised (so a small fraction of the dataset is labelled) and
    • interactive (with humans-in-the-loop)
  • the phenomena is time-varying

These questions are addressed to experts in statistical/machine learning and visualization or human-computer interactions. The background information is given in the list of references below for concreteness. Please see https://tinyurl.com/yaep8k2w for further industrial/academic context.

Semi-supervised Anomaly Detection with Human-in-the-Loop

  • What algorithms are there for incorporating expert human feedback into anomaly detection, especially with auto-encoders, and what are their limitations when scaling to terabytes of data?
  • Can one incorporate expert human feedback with anomaly detection for continuous time series data of large networks (eg. network logs data such as netflow logs)?
  • How do you avoid overfitting to known types of anomalies that make up only a small fraction of all events?
  • How can you allow for new (yet unknown anomalies) to be discovered by the model, i.e. account for new types of anomalies over time?
  • Can Ladder Networks which were specially developed for semi-supervised learning be adapted for generic anomaly detection (beyond standard datasets)?
  • Can a loss function be specified for an auto-encoder with additional classifier node(s) for rare anomalous events of several types via interaction with the domain expert?
  • Are there natural parametric families of loss functions for tuning hyper-parameters, where the loss functions can account for the budgeting costs of distinct set of humans with different hourly costs and tagging capabilities within a generic human-in-the-loop model for anomaly detection?

Some ideas to start brain-storming:

  • For example, the loss function in the last question above could perhaps be justified using notions such as query-efficiency in the sense of involving only a small amount of interaction with the teacher/domain-expert (Supervised Clustering, NIPS Proceedings, 2010).
  • Do an SVD of the network data when dealing with time-series of large networks that are tall and skinny and look at the distances between the dominant singular vectors, perhaps?

Interactive Visualization for the Human-in-the-Loop

Given the crucial requirement for rich visual interactions between the algorithm and the human-in-the-loop, what are natural open-source frameworks for programmatically enriching this human-algorithm interaction via visual inspection and interrogation (such as SVDs of activations of rare anomalous events for instance).

For example, how can open source tools be integrated into Active-Learning and other human-in-the-loop Anomaly Detectors? Some such tools include:

Beyond, visualizing the ML algorithms, often the Human-in-the-Loop needs to see the details of the raw event that triggered the Anomaly. And typically this event needs to be seen in the context of other related and relevant events, including its anomaly score with some historical comparisons of similar events from a no-SQL query. What are some natural frameworks for being able to click the event of interest (say those alerted by the algorithm) and visualize the raw event details (usually a JSON record or a row of a CSV file) in order to make an informed decision. Some such frameworks include:

for visualizations possibly powered by scalable fault-tolerant near-real-time SQL query engines such as:

Background Readings

Statistical Regular Pavings for Auto-encoded Anomaly Detection

This sub-project aims to explore the use of statistical regular pavings in Project SAHDE, including auto-encoded statistical regular pavings via appropriate tree arithmetics, for anomaly detection.

The Loda method might be extra interesting to this sub-project as we may be able to use histogram tree arithmetic for the multiple low-dimensional projections in the Loda method (see above).

Background Reading

General References

Funding

This programme is partly supported by:

  • databricks academic partners program for distributed cloud computing
  • research time for this project was party due to:
    • 2015, 2016 by the project CORCON: Correctness by Construction, Seventh Framework Programme of the European Union, Marie Curie Actions-People, International Research Staff Exchange Scheme with counter-part funding by The Royal Society of New Zealand
    • 2017, Researcher Position, Department of Mathematics, Uppsala University, Uppsala, Sweden.

Whiteboard discussion notes on 2017-08-18.

auto-encoder mapped regular pavings and naive probing

Some Background on Existing Industrial Solutions

The content in the next section is just a

$ wget -k https://community.tibco.com/wiki/anomaly-detection-autoencoder-machine-learning-template-tibco-spotfirer
$ pandoc -f html -t markdown anomaly-detection-autoencoder-machine-learning-template-tibco-spotfirer > ex.md
$ vim ex.md
$ !date
$ Fri Aug 18 18:33:05 CEST 2017

It is meant to give a brief introduction to the problem and a reasonably standard industrial solution and thus help set the context for industrially beneficial research directions. There are other competing solutions, but we will focus on this example for concreteness.



Anomaly Detection with Autoencoder Machine Learning - Template for TIBCO Spotfire®

By Venkata Jagannath from https://community.tibco.com/users/venkata-jagannath

Anomaly detection is a way of detecting abnormal behavior. The technique first uses machine learning models to specify expected behavior and then monitors new data to match and highlight unexpected behavior.

Use cases for Anomaly detection

Fighting Financial Crime – In the financial world, trillions of dollars’ worth of transactions happen every minute. Identifying suspicious ones in real time can provide organizations the necessary competitive edge in the market. Over the last few years, leading financial companies have increasingly adopted big data analytics to identify abnormal transactions, clients, suppliers, or other players. Machine Learning models are used extensively to make predictions that are more accurate.

Monitoring Equipment Sensors – Many different types of equipment, vehicles and machines now have sensors.  Monitoring these sensor outputs can be crucial to detecting and preventing breakdowns and disruptions.  Unsupervised learning algorithms like Auto encoders are widely used to detect anomalous data patterns that may predict impending problems. 

Healthcare claims fraud – Insurance fraud is a common occurrence in the healthcare industry. It is vital for insurance companies to identify claims that are fraudulent and ensure that no payout is made for those claims. The economist recently published an article that estimated $98 Billion as the cost of insurance fraud and expenses involved in fighting it. This amount would account for around 10% of annual Medicare & Medicaid spending. In the past few years, many companies have invested heavily in big data analytics to build supervised, unsupervised and semi-supervised models to predict insurance fraud.

Manufacturing detects – Auto encoders are also used in manufacturing for finding defects. Manual inspection to find anomalies is a laborious & offline process and building machine-learning models for each part of the system is difficult. Therefore, some companies implemented an auto encoder based process where sensor equipment data on manufactured components is continuously fed into a database and any defects (i.e. anomalies) are detected using the auto encoder model that scores the new data. Example

Techniques for Anomaly detection

Companies around the world have used many different techniques to fight fraud in their markets. While the below list is not comprehensive, three anomaly detection techniques have been popular -

Visual Discovery - Anomaly detection can also be accomplished through visual discovery. In this process, a team of data analysts/business analysts etc. builds bar charts; scatter plots etc. to find unexpected behavior in their business. This technique often requires prior business knowledge in the industry of operation and a lot of creative thinking to use the right visualizations to find the answers.

Supervised Learning - Supervised Learning is an improvement over visual discovery. In this technique, persons with business knowledge in the particular industry label a set of data points as normal or anomaly. An analyst then uses this labelled data to build machine learning models that will be able to predict anomalies on unlabeled new data.

Unsupervised Learning - Another technique that is very effective but is not as popular is Unsupervised learning. In this technique, unlabeled data is used to build unsupervised machine learning models.  These models are then used to predict new data. Since the model is tailored to fit normal data, the small number of data points that are anomalies stand out.

Some examples of unsupervised learning algorithms are -

Auto encoders – Unsupervised neural networks or auto encoders are used to replicate the input dataset by restricting the number of hidden layers in a neural network. A reconstruction error is generated upon prediction. Higher the reconstruction error, higher the possibility of that data point being an anomaly.

Clustering – In this technique, the analyst attempts to classify each data point into one of many pre-defined clusters by minimizing the within cluster variance. Models such as K-means clustering, K-nearest neighbors etc. used for this purpose. A K-means or a KNN model serves the purpose effectively since they assign a separate cluster for all those data points that do not look similar to normal data.

One-class support vector machine – In a support vector machine, the effort is to find a hyperplane that best divides a set of labelled data into two classes. For this purpose, the distance between the two nearest data points that lie on either side of the hyperplane is maximized. For anomaly detection, a One-class support vector machine is used and those data points that lie much farther away than the rest of the data are considered anomalies.

Time Series techniques – Anomalies can also be detected through time series analytics by building models that capture trend, seasonality and levels in time series data. These models are then used along with new data to find anomalies. Industry example

Auto encoders explained

Autoencoders use unsupervised neural networks that are both similar to and different from a traditional feed forward neural network. It is similar in that it uses the same principles (i.e. Backpropagation) to build a model. It is different in that, it does not use a labelled dataset containing a target variable for building the model. An unsupervised neural network also known as an Auto encoder uses the training dataset and attempts to replicate the output dataset by restricting the hidden layers/nodes.

The focus on this model is to learn an identity function or an approximation of it that would allow it to predict an output that is similar the input. The identity function achieves this by placing restrictions on the number of hidden units in the data. For example, if we have 10 columns in a dataset and only five hidden units, the neural network is forced to learn a more restricted representation of the input. By limiting the hidden units, we can force the model to learn a pattern in the data if there indeed exists one.

Not restricting the number of hidden units and instead specifying a ‘sparsity’ constraint on the neural network can also find an interesting structure.

Each of the hidden units can be either active or inactive and an activation function such as ‘tanh’ or ‘Rectifier’ can be applied to the input at these hidden units to change their state.

Some forms of auto encoders are as follows –

  • Under complete Auto encoders
  • Regularized Auto encoders
  • Representational Power, Layer Size and Depth
  • Stochastic Encoders and Decoders
  • Denoising Auto encoders

A detailed explanation of each of these types of auto encoders is available here.

Spotfire Template for Anomaly detection

TIBCO Spotfire’s Anomaly detection template uses an auto encoder trained in H2O for best in the market training performance. It can be configured with document properties on Spotfire pages and used as a point and click functionality.

Download the template from the Component Exchange.  See documentation in the download distribution for details on how to use this template