Variable Transformations Flashcards
Log Transformations and Root Transformations
Log will draw large numbers towards 0, effectively squishing the original values.
-Only positive values can be log transformed.
-Values between 0 and 1 become more and more negative
-Base is arbitrary, but common to use base e.
Square root or cube root (on non-negative values) are similar to a log transformation in that they draw large numbers to 0.
Useful for reducing right skewness
->skewness and outliers tend to highly influence a variable’s average.
A transformation could reduce the leverage that large outliers have in fitting the model.
The transformation will shrink the large values relative to the smaller values. This should reduce the phenomenon where the residuals grew in variability as the fitted values increased in the OLS using the transformed target compared to those of the OLS on the untransformed target variable.
When modeling, keep in mind that its predictions are likely in the transformed unit rather than the original unit
If the natural log of the target is modeled to follow a normal distribution, it means the target is modeled to follow a lognormal distribution
–Use exp(predict(model, newdata = dataset)) when computing RMSE, etc.
–lognormal is valid for positive numbers
Numeric vs Factor
Two considerations:
1. Is the variable a factor by nature?
->yes = transform it -> factor() or as.factor()
->no = consider #2 and also see if computations make sense. If not, may need to transform
2. How many unique values does the variable have?
->If many, keep numeric. Can also transform by grouping into ranges
->If few, make sure it makes sense as a factor before transforming.
Factor rather than numeric:
Reason 1: Converting to a factor variable allows each level of the factor to be examined separately. For example, a regularized regression can distinguish which levels of the factor are significant.
Reason 2: A factor variable may be more interpretable than numeric. For example, it may be simpler to convey discrete age group comparisons than a general linear effect of age.
Reason 3: Converting to a factor variable allows for more complex relationships to the target variable since separate coefficients are estimated for each level.
Dummy Variables
Parametric models require predictors to be numeric
Dummy variables are how factors can be represented as numeric predictors. A dummy variable takes on the values of 0 or 1, and is tied to one level of a factor.
It takes w - 1 dummy variables to represent a factor with w levels. The level that does not have a dummy variable is the reference level which is represented by all the dummy variables equaling 0.
Interpretation: b_j is the change in y^ for a level compared to a reference level observation, assuming all other predictors are held constant
o Linear Models: Linear models fit a coefficient for each level of a categorical variable except the base level. The coefficient for each level represents the impact relative to the base level of the variable. This is equivalent to “one-hot” encoding, which creates multiple new variables with value of 1 for observations at that level of the categorical variable and 0 otherwise.
o Tree-Based Models: Decision trees split the levels of a categorical variable into groups. The more levels the variable has, the more potential ways to split the category into groups. The decision tree algorithm will identify which variables to split and into which groups based on maximizing information gain. Decision trees naturally allow for interactions between categorical variables based on how the tree is created. For instance, a leaf node could have two or more parent nodes that split based on categorical variables, which would represent the interactions of those categorical variables. The tree may also split on the same variable more than once in the tree.
Binarization
Binarization is the process of creating dummy variables manually to supply to the R modeling function rather than supplying the factor
Binarization allows algorithms to drop individual factor levels if they are not significant as compared to the base level. This can lead to a simpler model with fewer coefficients to estimate. The disadvantage of binarization is that nonsensical results can be obtained if only a handful of factor levels are included in the model. For example, if I binarized education_num, I might find that having education_num = 7 leads to higher value applicants and is the only factor level included in the model, suggesting that education only matters if you stop at that exact education_num.
Binarization allows us to select the reference level while automatic does not. If a different base level is chosen with manual binarization, the different reference level will result in different coefficient estimates because they are relative to the reference level. Also, the p-values will be different because the p-values are relative to the base levels.