Thursday, March 2, 2017

The Popping Pink



The popping pink
I was seated on my couch of the sofa comfortably watching a sports show on my television. There was a fond and vivid memory of the three chocolates that I had munched and then gobbled one after the other in quick succession. There were precisely three chocolate wrappers that were making funny sounds in the left pocket of my pyajama.
Suddenly, the shun shown stupendously and so much was its vigor that it took my eyes of their comfort zone and all the way up to the center of the garden of our house. So much perfect and precise the rays were that I was forced to fixate my eyes on to a very large pink flower. I was so surprised to look at the radius of that flower that it evoked both a moment of happiness and emotion of tears on my face.
I instantly took my phone and switched on its camera to take a snap of the popping pink flower. When I had met my desire to take that snap, I then decided to go to the front of the house and take the snap from there. It was as if the entire flower was pellucid, a feature which was forcing me to take an all-round view of it by forcing me to take a peripatetic path.
When I told my wife about that flower, then she asked me the question “How old is it?”, then swiftly came the reply from me “How is it old?”
The queer syndrome did not stop there. Later I noticed the pocked of my pyjamas and found to my surprise that the chocolate wrappers were brick red and white in color, the colors which when mixed yield a bright pink hue just like the one that was painted on that popping pink flower.

Sunday, January 17, 2016

Variable Selection

First, apply simple steps to remove junk variables. For instance, throw those variables that have got too high a proportion of missing values, too low coefficient of variation, etc.
Then apply Weight of Evidence and Information Value (IV).
Then throw those variables with IV either too low or too high.
Then choose those values that have a high WoE.
You can do this for categorical variables. For continuous variables, first apply binning and then apply WoE and IV to those.
Then apply VIF (Variance Inflation Factor) to remove those variables having high muliticollinearity.

Addressing Multicollinearity

First, X and X^2 or for that mater X and XZ can not be considered as co-linear in a mathematical sense. Correlation coefficient as we commonly used to
measure collinearity between two variables bound give erroneous results. So, collinearity measured by this measure is not useful in such a case.

One should use correlation coefficient if you are relatively sure about linear relationship between two variables.

In case of Logistic Regression, All predictors should follow normal distribution and normalization is done to make sure that predictors are unit less (and hence additive in true sense). So if X and Z follows normal distribution then neither X^2, Z^2 nor XZ follows normal distribution. Now one can argue that after applying CLT XZ, X^2 or Z^2 will follow normal distribution, but under limit (n -> inf)
Hence X^2 or Z^2 or XZ in first place does not qualify to be predictors in case of logistic regression.

The idea of any predictive model is to include all those predictor which has predictive value (or has information about the target variable), if X is a predictor and brings some information what more information X^2 can bring is the question one should consider. In fact by nature of function X^2 it conceals or confounds some of the information which X brings. similar is the case for XZ. (I need not explain that there are better ways to handle interaction effect of factors X and Z)

From above three points it is clear that if you decide to include X and/OR Z as your predictors, you should not include X^2, Z^2 and XZ as a predictor into same predictive model.

About p-values, it is clear that CLT is applied before calculating p-values. These are asymptotic p-values. Once I assume X^2 or XZ follows normal distribution it does not matter whether I standardize X^2 or XZ, p-values are bound to be same anyways there is no magic.



How often do you see customers rebuild/refresh models?

I am interested in how frequently you see customers rebuilding/refreshing models they have deployed in production? Are they using our C&DS automation to schedule these, and do they do any refreshing of their models automatically, or do they have analysts perform these manually?

This is what I see customers wanting:
-Champion challenger
-Self improving models, models learns as new data comes in (ala our naive Bayes, 'self learning' model).  

This is what I see customers doing:
‎-use CADS to store models.  
-manually score
-sometime schedule score
-real time deployment.  
-Never: champion challenger. Reason: a model needs thorough checking when re-created.  Interface in CADS is not on par.  
-Never: refresh: R‎eason: a model needs thorough checking.  Refresh works well, but the storing and replacing the model is too complex because it involves scripting and Modeler/CADS interplay.  

This is what I customers want to do:

Analytical reporting:
-being able to set up an experiment in CADS to keep track of model performance over the lifetime of the model. 
-having a comprehensive (=prebuilt) and configurable model evaluation dashboard.  

Operational reporting:
-being able to see what ‎the model predicts (without knowing the outcomes yet, hence the operational reporting).  
-having a comprehensive and configurable model scoring dashboard.  

In addition: having both abilities working if one has many models in production in a convenient way.  On top of this massive model deployment, having a way to quickly get insight in the model trends and being able to alert the worse performing models.  


CLV (Customer Lifetime Modeling) for Retail (supermarket and grocery chain)

First challenge is identifying customer over visits. There's some entity analytics that can be done: you associate loyalty cards with creditcard/bank account numbers and so you can even identify when the same customer changes cards. Don't expect 100% identification here.   Divide your purchases in loose baskets vs identified customers and provide a separate treatment for both.  ‎For the former you can only provide margin details per basket.   

Revenue= easy, sum of purchases -returns. 
Costs = tricky. Top down approach. Get the details from finance and collaborate with finance on this. If they sell say fresh good and electronics, likely, finance has a p&l for them separately. If you find the costs on department level are 35% of the total revenue, you use that number for every product in that category.   Try to get as deep as possible. Likely you cant do product level.   Important distinction is own brand vs foreign brand, so you can have several factors you can account for. Likely you only have cost data for the main effects (own brand vs foreign brand and category A vs B rather‎ then for each of the 4 combination). Use the raking (iterative fitting) for it to get at the 4 combination level. (raking is used in reweighting surveys to make the research group having similar properties to the population). Available in SPSS.   

Now revenue - cost is margin. On product group level very interesting to ‎visualize. (revenue vs margin%, revenue vs costs, spot the outliers, do some segmentation, color the scatterplots with the segmentation and various other characteristics you have).  

Now roll up to customer level. Use 1+ year of purchases, provide numbers on a year basis.   Again, show that customer margin is not just‎ a matter of taking say 15% of the revenue, but differs for everyone. Segment customers on margin and per resulting segment profile the product groups they buy from.  

Next lifetime.  ‎Properly divide your available time line into parts and use the purchase data from say month 1 to 6 to predict the purchase amount for the next six months. Validate this model by back testing it on the data the year before. Depending on the structure in the data, the model can try to predict 1) are you returning (0/1) 2) what will be your revenue class, or 3) what will be your revenue.    You can try to do the same for margin in order to get a sense of customers that change their buying habits.  

Now you can future tell the customer value. You can properly try to account for net present value etc, but I believe the models will never give you enough resolution to justify going through that additional logic.  

Use the future revenue, costs, margin in relation with the current ones and segment to store level (store type, province, area characteristics etc).  

You can use the results as follows:

1) predict overall revenue and margin for next 6 months (interesting for finance in order to determine strategy, specially on store level)
2) spot customers who are going upward or downward (interesting for campaigning purposes).  
‎3) understand the effects of promotions of article categories in the light of those newly obtained kpi's (interesting for category managers).  

Tuesday, March 4, 2014

Workforce Attrition Predictive Modeling (SPSS)

Recently, I was working on an Analytics / Data Mining project to predict the chances of an employee's attrition. I used SPSS for doing this, although I came to know that there could be more tools like SAS to do this kind of predictive modeling. Here are a few factors that I came up with to predict a Company's employee attrition (employee initiated). If you can suggest some factors, that would be really good to know!

Tenure in system in years
Any department level features, like average/less promotion year, etc.
Last increment % (was it greater than 1%?)
Last annual bonus % (was it greater than 5%?)
Number of perks, e.g., entitlement for a car, work from home, etc.
Last Promotion date? Was the employee ever promoted?
External factors (e.g., geography)

Sunday, September 1, 2013

Outlier Detection

Recently prepare an SPSS Modeler Manual for Anomaly Detection! Was a great experience.

• Outlier detection including Fraud detection could be considered to be largely an imbalanced class problem, because the event rate could be quite low (e.g., 1-2%). • There are different ways in which fraud can happen, e.g., involving many people spread over different geographies, or a single person utilizing his expertise to attempt fraud. Moreover, there could be lot of instances where a type of fraud is having a relationship with, and is hence linked to some other types of fraud. Also, there could be some situations where a certain fraud happened over a span of just a few days, whereas in some other situation, another fraud happened over a span of several months. • It would not be enough to classify an entity as being an outlier vs. normal entity. This is because it is entirely possible that a legitimate customer is carrying out an “outlier” transaction worth hundreds of dollars. Hence, it is important to keep the False Alarms to be as low as possible. • One way is to detect outlier customers in the context of customer groups as a) contextual attributes and b) behavioral attributes. But, what if we cannot clearly partition the data into contexts (e.g., I was always a Domestic user and used my mobile phone to make many local calls, but since say, the last fifteen days, there have been too many International activity from my phone)? Then, in that case, model the Normal behavior with respect to context. Using Training dataset, train a model that predicts the expected behavior attribute values with respect to contextual attribute values. Then, an Outlier is one if its behavioral attitude values significantly deviate from the values predicted by the model. • Solving the above step: In real world, labeled instances can be difficult as well as expensive to obtain. Hence, we should consider the application of semi-supervised learning for fraud detection in mobile networks. In this, we model only Normality. Hence, the Normal class is taught generally using unsupervized learning techniques like Clustering (we may also try to apply auto associative neural networks), but the algorithm learns to recognize abnormality. It aims to define a boundary of Normality, even though it learns only one class. This type of learning can detect new patterns of fraud too. How? Novelty (or Outliers) can be detected as the model generates closed boundaries around the Normal patterns. Also, utilizing Novel data helps in the objective of Novelty Detection (ND) since it can specialize from data, and hence can increase the classification accuracy of model too. • Similarly, to detect fraud in Network intrusion detection systems, unsupervised learning techniques are applied to minimize the number of false alarms. • Hence, the ND problem can be summarized as: where only samples from Normal "unchanged" regions are available to detect outliers / novelties / "changed regions". • We can also consider semi-supervised learning as being a scenario where some clusters are formed sufficiently because the attributes in them are all labeled, but the other clusters are not yet labeled, and hence we can use the labeled clusters (such as Normal instances) to assign labels to other clusters (such as Outlier instances). • Imbalanced classes can pose a problem for Supervised learning techniques. • SSND (Semi supervised novelty detection) has been applied for a setup that is imbalanced binary classification between labeled and unlabeled samples, without using any labeled “changed” samples. The methods, based on the Cost-Sensitive Support Vector Machine, assign different error costs for the two classes. The errors done on the unlabeled samples containing both “unchanged” (i.e., Normal) and “changed” (i.e., Outlier) samples are less penalized than those committed on labeled “unchanged” samples. The novelty of their proposed methods reside in the retrieval of the solutions for different cost asymmetries in a single optimization. • The One-Class Support Vector Machine proposed by Sch¨olkopf et al. maximizes the margin between the data and the origin in a higher dimensional feature space. Similarly, the Support Vector Data Description (SVDD) defines a sphere around the target data in the induced feature space and detects outliers outside its boundary. • While semi-supervised clustering deal effectively with reduced amounts of labeled instances for creating a classifier, SNA (Social Network Analysis) contribute to derive new attributes for characterizing agents, which means that they make possible to introduce more information and domain knowledge on describing instances. • Spotting fraud through SNA: Fraudsters typically create a community with the sole purpose of committing fraud. Now, because there may not be a viral effect, so they may not spread their ideas like those who are thinking of changing providers or who’ve found bargain prices. So, SNA can help the Telecom operators to spot the differences in calling patterns within a community. These differences can then be measured so that outliers stand a chance to become obvious and thus, investigable. • Note: SSND deals with situations having only labeled “unchanged” samples. These labeled “unchanged” instances are exploited jointly with a large set of unlabeled samples. No information about changes, that lie among the unlabeled data, is available a priori. Also, the unlabeled samples contain both “unchanged” and “changed” samples. • When data relationships become complex and non-linear (fraud patterns today are not linear, e.g., customer making a large transaction related to purchase on while on vacation abroad that too only once a year – here, neural networks can understand this as anomaly and not fraud). • Data mining techniques selected will be Clustering, Outlier detection and Association mining (sequence rules). If feasible, SNA (Social Network Analysis) can be used. The analytical tool used could be SPSS Statistics. REFERENCES: [1] B. Scholkopf, R. Williamson, A. Smola, J. Shawe-Taylor, and J. Platt, “Support vector method for novelty detection,” Advances in Neural Information Processing Systems, vol. 12, pp. 582–588, 2000 [2] B. Sch¨olkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, and R.C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural Computat., vol. 13, no. 7, pp. 1443–1471, 2001 [3] D.M.J. Tax and R.P.W. Duin, “Support vector data description,” Machine Learn., vol. 54, no. 1, pp. 45–66, 2004 [4] Data Mining for Fraud Detection: Toward an Improvement on Internal Control Systems? Mieke Jans, Nadine Lybaert, Koen Vanhoof [5] Combining Social Network Analysis with Semi-supervised Clustering: a case study on fraud detection João Botelho Instituto Superior Técnico Av. Rovisco Pais 1 1049-001 Lisboa, Portugal +351 218 419 407 joao.botelho@ist.utl.pt Cláudia Antunes Instituto Superior Técnico Av. Rovisco Pais 1 1049-001 Lisboa, Portugal +351 918 358 590 claudia.antunes@ist.utl.pt [6] Semi-Supervised Novelty Detection using SVM entire solution path Frank de Morsier, Student Member, IEEE, Devis Tuia Member, IEEE, Maurice Borgeaud, Senior Member, IEEE, Volker Gass, Jean-Philippe Thiran, Senior Member, IEEE)