Sunday, January 17, 2016

Addressing Multicollinearity

First, X and X^2 or for that mater X and XZ can not be considered as co-linear in a mathematical sense. Correlation coefficient as we commonly used to
measure collinearity between two variables bound give erroneous results. So, collinearity measured by this measure is not useful in such a case.

One should use correlation coefficient if you are relatively sure about linear relationship between two variables.

In case of Logistic Regression, All predictors should follow normal distribution and normalization is done to make sure that predictors are unit less (and hence additive in true sense). So if X and Z follows normal distribution then neither X^2, Z^2 nor XZ follows normal distribution. Now one can argue that after applying CLT XZ, X^2 or Z^2 will follow normal distribution, but under limit (n -> inf)
Hence X^2 or Z^2 or XZ in first place does not qualify to be predictors in case of logistic regression.

The idea of any predictive model is to include all those predictor which has predictive value (or has information about the target variable), if X is a predictor and brings some information what more information X^2 can bring is the question one should consider. In fact by nature of function X^2 it conceals or confounds some of the information which X brings. similar is the case for XZ. (I need not explain that there are better ways to handle interaction effect of factors X and Z)

From above three points it is clear that if you decide to include X and/OR Z as your predictors, you should not include X^2, Z^2 and XZ as a predictor into same predictive model.

About p-values, it is clear that CLT is applied before calculating p-values. These are asymptotic p-values. Once I assume X^2 or XZ follows normal distribution it does not matter whether I standardize X^2 or XZ, p-values are bound to be same anyways there is no magic.



No comments:

Post a Comment