Past Exams Flashcards
Features in K means clustering
Increasing the number of features may capture more complex patterns but:
-The interpretably becomes more complex and less useful for a predictive model. With two features the clusters can be interpreted visually with a scatterplot.
-Outliers in any feature added can be considered. For k-mean outliers can be assigned their own cluster if the distance is too great.
-If numeric variables have different range, we need to scale these variables.
How does complexity pruning works and is optimized?
Cost complexity pruning grows a large tree and then prunes it back by dropping splits that do not reduce the model error by a fixed value determined by the complexity parameter. We can use cv for the cp, which is the process of training models and testing models on different fold of data. This is done with multiple values of the cp and then the one with the lower cv is selected.
Finding lambda for lasso
A number of lambda values is chosen for the search, and then a cross validation error is calculate for each value.
Cross validation error is calculated by dividing the data into k folds. A single fold is removed for testing and the rest are used to train a lasso model with the current lambda value. This process is repeated for each of the k partitions and the CV errors is calculated as the average error measures across all testing k partitions.
the optimal lambda is the one with the lowest error.
what the variance and bias values indicate in predictive model
The varience indicates how much the predictions varies depending on the training set used. As more predictors are used the varience increases because the model more precisely fit the training data and becomes more generalized. The bias figure indicate how close exoected predictions and and actual results are on unseen data. Generally as more predictions are made, the bias decreases and more actual predictions are made.
step wise vs lasso
stepwise requires the modeler to manually remove the predictor. LASSO automatically removes
predictors.
* stepwise removes the entire categorical variable. LASSO binarizes categorical variables and can
remove individual levels.
* stepwise removes one predictor at a time. LASSO assesses all predictors in a single model fitting.
high deviance
The higher test deviance of the GLM model indicates the model has overfit the training data. This overfitting results in higher deviance when evaluating the model on the
unseen test data.