Sunday, September 1, 2013

Outlier Detection

Recently prepare an SPSS Modeler Manual for Anomaly Detection! Was a great experience.

• Outlier detection including Fraud detection could be considered to be largely an imbalanced class problem, because the event rate could be quite low (e.g., 1-2%). • There are different ways in which fraud can happen, e.g., involving many people spread over different geographies, or a single person utilizing his expertise to attempt fraud. Moreover, there could be lot of instances where a type of fraud is having a relationship with, and is hence linked to some other types of fraud. Also, there could be some situations where a certain fraud happened over a span of just a few days, whereas in some other situation, another fraud happened over a span of several months. • It would not be enough to classify an entity as being an outlier vs. normal entity. This is because it is entirely possible that a legitimate customer is carrying out an “outlier” transaction worth hundreds of dollars. Hence, it is important to keep the False Alarms to be as low as possible. • One way is to detect outlier customers in the context of customer groups as a) contextual attributes and b) behavioral attributes. But, what if we cannot clearly partition the data into contexts (e.g., I was always a Domestic user and used my mobile phone to make many local calls, but since say, the last fifteen days, there have been too many International activity from my phone)? Then, in that case, model the Normal behavior with respect to context. Using Training dataset, train a model that predicts the expected behavior attribute values with respect to contextual attribute values. Then, an Outlier is one if its behavioral attitude values significantly deviate from the values predicted by the model. • Solving the above step: In real world, labeled instances can be difficult as well as expensive to obtain. Hence, we should consider the application of semi-supervised learning for fraud detection in mobile networks. In this, we model only Normality. Hence, the Normal class is taught generally using unsupervized learning techniques like Clustering (we may also try to apply auto associative neural networks), but the algorithm learns to recognize abnormality. It aims to define a boundary of Normality, even though it learns only one class. This type of learning can detect new patterns of fraud too. How? Novelty (or Outliers) can be detected as the model generates closed boundaries around the Normal patterns. Also, utilizing Novel data helps in the objective of Novelty Detection (ND) since it can specialize from data, and hence can increase the classification accuracy of model too. • Similarly, to detect fraud in Network intrusion detection systems, unsupervised learning techniques are applied to minimize the number of false alarms. • Hence, the ND problem can be summarized as: where only samples from Normal "unchanged" regions are available to detect outliers / novelties / "changed regions". • We can also consider semi-supervised learning as being a scenario where some clusters are formed sufficiently because the attributes in them are all labeled, but the other clusters are not yet labeled, and hence we can use the labeled clusters (such as Normal instances) to assign labels to other clusters (such as Outlier instances). • Imbalanced classes can pose a problem for Supervised learning techniques. • SSND (Semi supervised novelty detection) has been applied for a setup that is imbalanced binary classification between labeled and unlabeled samples, without using any labeled “changed” samples. The methods, based on the Cost-Sensitive Support Vector Machine, assign different error costs for the two classes. The errors done on the unlabeled samples containing both “unchanged” (i.e., Normal) and “changed” (i.e., Outlier) samples are less penalized than those committed on labeled “unchanged” samples. The novelty of their proposed methods reside in the retrieval of the solutions for different cost asymmetries in a single optimization. • The One-Class Support Vector Machine proposed by Sch¨olkopf et al. maximizes the margin between the data and the origin in a higher dimensional feature space. Similarly, the Support Vector Data Description (SVDD) defines a sphere around the target data in the induced feature space and detects outliers outside its boundary. • While semi-supervised clustering deal effectively with reduced amounts of labeled instances for creating a classifier, SNA (Social Network Analysis) contribute to derive new attributes for characterizing agents, which means that they make possible to introduce more information and domain knowledge on describing instances. • Spotting fraud through SNA: Fraudsters typically create a community with the sole purpose of committing fraud. Now, because there may not be a viral effect, so they may not spread their ideas like those who are thinking of changing providers or who’ve found bargain prices. So, SNA can help the Telecom operators to spot the differences in calling patterns within a community. These differences can then be measured so that outliers stand a chance to become obvious and thus, investigable. • Note: SSND deals with situations having only labeled “unchanged” samples. These labeled “unchanged” instances are exploited jointly with a large set of unlabeled samples. No information about changes, that lie among the unlabeled data, is available a priori. Also, the unlabeled samples contain both “unchanged” and “changed” samples. • When data relationships become complex and non-linear (fraud patterns today are not linear, e.g., customer making a large transaction related to purchase on while on vacation abroad that too only once a year – here, neural networks can understand this as anomaly and not fraud). • Data mining techniques selected will be Clustering, Outlier detection and Association mining (sequence rules). If feasible, SNA (Social Network Analysis) can be used. The analytical tool used could be SPSS Statistics. REFERENCES: [1] B. Scholkopf, R. Williamson, A. Smola, J. Shawe-Taylor, and J. Platt, “Support vector method for novelty detection,” Advances in Neural Information Processing Systems, vol. 12, pp. 582–588, 2000 [2] B. Sch¨olkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, and R.C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural Computat., vol. 13, no. 7, pp. 1443–1471, 2001 [3] D.M.J. Tax and R.P.W. Duin, “Support vector data description,” Machine Learn., vol. 54, no. 1, pp. 45–66, 2004 [4] Data Mining for Fraud Detection: Toward an Improvement on Internal Control Systems? Mieke Jans, Nadine Lybaert, Koen Vanhoof [5] Combining Social Network Analysis with Semi-supervised Clustering: a case study on fraud detection João Botelho Instituto Superior Técnico Av. Rovisco Pais 1 1049-001 Lisboa, Portugal +351 218 419 407 joao.botelho@ist.utl.pt Cláudia Antunes Instituto Superior Técnico Av. Rovisco Pais 1 1049-001 Lisboa, Portugal +351 918 358 590 claudia.antunes@ist.utl.pt [6] Semi-Supervised Novelty Detection using SVM entire solution path Frank de Morsier, Student Member, IEEE, Devis Tuia Member, IEEE, Maurice Borgeaud, Senior Member, IEEE, Volker Gass, Jean-Philippe Thiran, Senior Member, IEEE)