Sunday, January 17, 2016

Variable Selection

First, apply simple steps to remove junk variables. For instance, throw those variables that have got too high a proportion of missing values, too low coefficient of variation, etc.
Then apply Weight of Evidence and Information Value (IV).
Then throw those variables with IV either too low or too high.
Then choose those values that have a high WoE.
You can do this for categorical variables. For continuous variables, first apply binning and then apply WoE and IV to those.
Then apply VIF (Variance Inflation Factor) to remove those variables having high muliticollinearity.

Addressing Multicollinearity

First, X and X^2 or for that mater X and XZ can not be considered as co-linear in a mathematical sense. Correlation coefficient as we commonly used to
measure collinearity between two variables bound give erroneous results. So, collinearity measured by this measure is not useful in such a case.

One should use correlation coefficient if you are relatively sure about linear relationship between two variables.

In case of Logistic Regression, All predictors should follow normal distribution and normalization is done to make sure that predictors are unit less (and hence additive in true sense). So if X and Z follows normal distribution then neither X^2, Z^2 nor XZ follows normal distribution. Now one can argue that after applying CLT XZ, X^2 or Z^2 will follow normal distribution, but under limit (n -> inf)
Hence X^2 or Z^2 or XZ in first place does not qualify to be predictors in case of logistic regression.

The idea of any predictive model is to include all those predictor which has predictive value (or has information about the target variable), if X is a predictor and brings some information what more information X^2 can bring is the question one should consider. In fact by nature of function X^2 it conceals or confounds some of the information which X brings. similar is the case for XZ. (I need not explain that there are better ways to handle interaction effect of factors X and Z)

From above three points it is clear that if you decide to include X and/OR Z as your predictors, you should not include X^2, Z^2 and XZ as a predictor into same predictive model.

About p-values, it is clear that CLT is applied before calculating p-values. These are asymptotic p-values. Once I assume X^2 or XZ follows normal distribution it does not matter whether I standardize X^2 or XZ, p-values are bound to be same anyways there is no magic.



How often do you see customers rebuild/refresh models?

I am interested in how frequently you see customers rebuilding/refreshing models they have deployed in production? Are they using our C&DS automation to schedule these, and do they do any refreshing of their models automatically, or do they have analysts perform these manually?

This is what I see customers wanting:
-Champion challenger
-Self improving models, models learns as new data comes in (ala our naive Bayes, 'self learning' model).  

This is what I see customers doing:
‎-use CADS to store models.  
-manually score
-sometime schedule score
-real time deployment.  
-Never: champion challenger. Reason: a model needs thorough checking when re-created.  Interface in CADS is not on par.  
-Never: refresh: R‎eason: a model needs thorough checking.  Refresh works well, but the storing and replacing the model is too complex because it involves scripting and Modeler/CADS interplay.  

This is what I customers want to do:

Analytical reporting:
-being able to set up an experiment in CADS to keep track of model performance over the lifetime of the model. 
-having a comprehensive (=prebuilt) and configurable model evaluation dashboard.  

Operational reporting:
-being able to see what ‎the model predicts (without knowing the outcomes yet, hence the operational reporting).  
-having a comprehensive and configurable model scoring dashboard.  

In addition: having both abilities working if one has many models in production in a convenient way.  On top of this massive model deployment, having a way to quickly get insight in the model trends and being able to alert the worse performing models.  


CLV (Customer Lifetime Modeling) for Retail (supermarket and grocery chain)

First challenge is identifying customer over visits. There's some entity analytics that can be done: you associate loyalty cards with creditcard/bank account numbers and so you can even identify when the same customer changes cards. Don't expect 100% identification here.   Divide your purchases in loose baskets vs identified customers and provide a separate treatment for both.  ‎For the former you can only provide margin details per basket.   

Revenue= easy, sum of purchases -returns. 
Costs = tricky. Top down approach. Get the details from finance and collaborate with finance on this. If they sell say fresh good and electronics, likely, finance has a p&l for them separately. If you find the costs on department level are 35% of the total revenue, you use that number for every product in that category.   Try to get as deep as possible. Likely you cant do product level.   Important distinction is own brand vs foreign brand, so you can have several factors you can account for. Likely you only have cost data for the main effects (own brand vs foreign brand and category A vs B rather‎ then for each of the 4 combination). Use the raking (iterative fitting) for it to get at the 4 combination level. (raking is used in reweighting surveys to make the research group having similar properties to the population). Available in SPSS.   

Now revenue - cost is margin. On product group level very interesting to ‎visualize. (revenue vs margin%, revenue vs costs, spot the outliers, do some segmentation, color the scatterplots with the segmentation and various other characteristics you have).  

Now roll up to customer level. Use 1+ year of purchases, provide numbers on a year basis.   Again, show that customer margin is not just‎ a matter of taking say 15% of the revenue, but differs for everyone. Segment customers on margin and per resulting segment profile the product groups they buy from.  

Next lifetime.  ‎Properly divide your available time line into parts and use the purchase data from say month 1 to 6 to predict the purchase amount for the next six months. Validate this model by back testing it on the data the year before. Depending on the structure in the data, the model can try to predict 1) are you returning (0/1) 2) what will be your revenue class, or 3) what will be your revenue.    You can try to do the same for margin in order to get a sense of customers that change their buying habits.  

Now you can future tell the customer value. You can properly try to account for net present value etc, but I believe the models will never give you enough resolution to justify going through that additional logic.  

Use the future revenue, costs, margin in relation with the current ones and segment to store level (store type, province, area characteristics etc).  

You can use the results as follows:

1) predict overall revenue and margin for next 6 months (interesting for finance in order to determine strategy, specially on store level)
2) spot customers who are going upward or downward (interesting for campaigning purposes).  
‎3) understand the effects of promotions of article categories in the light of those newly obtained kpi's (interesting for category managers).