Anomaly detection (also known as outlier detection) is the process of finding data objects (points, events, observations) with behaviors that are very different from the dataset’s standard behavioral patterns. It has many applications in business and is used to find critical incidents, such as a fraud, technical glitch, logistical obstacle. Such objects are called outliers or anomalies.
One of the most significant points of avoiding mistakes is understanding the type of anomaly. Without knowing it, you risk making false signals or miss outliers. Generally speaking, anomalies fall into three main categories — global outliers, contextual outliers, and collective outliers.
Global outliers are also known as point anomalies. It is the most common type and corresponds to the very basic idea of anomalies, which is centered around two values — extremely high and extremely low from the rest of the data points. The main idea of detecting global anomalies is to figure out the exact amount of deviation, which separates a potential anomaly from the rest of the data. Global anomalies are quite often used in the transnational auditing systems to detect fraud transactions. In this case, global anomalies are those transactions which violate the general regulations.
Contextual outliers are also known as conditional anomalies. They have values that significantly deviate from the other data points of the same context. It can be an anomaly in the context of one dataset but not in another. These outliers are frequent in time series data because it is composed of a sequence of values over time, and a particular period can be considered as particular context. The value exists within global expectations but may appear anomalous within specific seasonal data patterns. Contexts are almost always very domain-specific.
When a subset of data points within a set is anomalous to the entire dataset, those values are called collective outliers. The main idea behind collective anomalies is that the data points included in forming the collection may not be anomalies globally or contextually when considered individually.
In anomaly detection approaches, there are three major groups as well. The critical point of them is the number of outliers in a dataset and knowledge about it. The first group unites methods which can be used without any prior knowledge about anomalies in the data. The most straightforward approach is to find the data points that significantly deviate from common statistical properties of distribution, including mean, median, mode, and quantiles. Sometimes it can be a visual analysis of a boxplot or a histogram.
More complex techniques are essential to unsupervised clustering. It is grouping the similar kind of objects. Mathematically, this similarity is measured by distance measurement functions like Euclidean distance, Manhattan distance. Usually, K-Means clustering is used with appropriate distance measure selected empirically. The approach is predominantly retrospective and is analogous to a batch-processing system. It requires that all data should be available before processing and that the data is static. After successful learning, it can compare new items with the existing data.
In the most common cases, two commonly used sub-techniques can be distinguished: diagnosis and accommodation. A diagnostic approach highlights the potential outlying points, and then the system may remove these outlier points as errors from future processing of the dataset. Many diagnostic approaches prune the outliers iteratively until no more outliers are detected. The main idea of the accommodation methodology is incorporation the outliers into the distribution model and utilization them in classification methods.
The second group of the anomaly detection approaches requires pre-labeled data, tagged as normal or abnormal. It consists of supervised classification models. This is the binary classification task, testing of belonging to the only one. Classifiers are best suited to static data as the classification usually needs to be normalized. This approach type can be used for on-line classification, where the classifier learns the classification model and then classifies new exemplars in runtime with the learned model. If the new exemplar belongs to the region of normality it is classified as normal; otherwise, it is flagged as an outlier. To be learned, classification algorithms require a good spread of both normal and abnormal data, i.e., the dataset should have enough exemplars of abnormal data for learning.
The third group approaches are used for the dataset with very few or even without abnormal cases. Abnormal data is often challenging to obtain or expensive in many fault detection domains such as aircraft engine monitoring or fraud detection. This technique needs pre-labeled data but only learns data marked normal. It is similar to a semi-supervised recognition or detection task and can be considered semi-supervised as the normal class is taught, but the algorithm learns to recognize abnormality. These methods are suitable for static or dynamic data as they only learn one class which provides the model of normality and can learn the model incrementally as new data arrives, tuning the model to improve the fit as each new exemplar becomes available. A support vector machine (SVM) is one of the third group approaches. It is usually related to supervised learning, but there are extensions (one class SVM) that can be used to identify anomalies as unsupervised problems (in which training data are not labeled). The algorithm learns a soft boundary of the normal data instances using the training set, and then all new data points outside this boundary are classified as outliers.