CA Mock exams Flashcards
Fbinarization might impact the feature selection in stepwise selection
Binarization helps simplify the model to only the predictors that are deemed necessary, as individual level factor can be left out if they do not contribute significantly to the model. (high p value is not good)
However, including only some of the dummy variables from a factor will
result in the merging of certain levels. Since the resulting merger is purely performance-based, it can
complicate the model’s interpretability when unintuitive levels are combined
what a cutoff is and how it is involved in calculating the AUC
After a model produces prediction probabilities, we decide a cutoff to obtain predictions of positive and negative. All predictive probabilities above the cutoff will be predicted as positive. If the cutoff is too high, most observations will be negative, producing a high specificity, cutoff dictates the value of sensitivity and specificity
By plotting every possible pair of sensitivity and specificity values due to changing the cutoff, the results is an ROC curve. The points are connected from the bottom left of the plot, where sensitivity is 0 and specificity is 1, to the top right,the Area under the ROC is the AUC.
GLM vs decision tree for numeric predictor?
For decision trees a good numeric predictor have distinct intervals that lead to clear differences in the target, identifying the split point where there is a big differences
If there is not a strong slope, glm should not be used.
Difference in resulting features generated by principal components vs clustering
PCA results in numeric features called PC. They are a linear combination of the analyzed variables, which mean a pc summarizes the variables by specifying how much each variable contributes to its calculation.
Clustering identifies clusters or grouping based on the analyzed variables meaning it results in a factor. Similar observations are group into same cluster while dissimilar observation are group into different clusters.
Stepwise selection and regularization
Similarities:
Both produced feature selection, dropping predictors that do not contribute.
Avoid overfitting in the data, specially when the number of observation is small compared to predictors.
Reduce complexity
Difference
The way they measure flexibility, stepwise measures it by the number of predictors while regularization measures by shrinkage parameter. Stepwise would use AIC or BIC, regularization uses model accuracy metric calculated from cross validation.
best subset vs stepwise
Best subset selection is performed by fitting all p models, that contain exactly one predictor and picking the model with smallest deviance,
fitting all p choose 2 models that contain exactly 2 predictors and picking the model with lowest
deviance, and so forth. Then a single best model is selected from the models picked, using a metric such
as AIC. it can be quite a large search space as p
increase
Stepwise selection is an alternative to best subset selection, which is computationally more efficient,
since it considers a much smaller set of models. For example, forward stepwise selection begins with a
model containing no predictors, and then adds predictors to the model, one-at-a-time, until adding a
predictor leads to a worse model by a measure such as AIC. At each step the predictor that gives the
greatest additional improvement to the fit is added to the model. The best model is the one fit just
before adding a variable leads to a decrease in performance.
Impurity measure in classification trees
-decide which split in the decision
tree (if any) should be made next
-decide which
branches of the tree to prune back after building a decision tree by removing branches that don’t
achieve a defined threshold of impurity reduction through cost-complexity pruning
single vs complete linkage
The clustering algorithm starts out with n clusters and fuses them together in an iterative process based
on which observations are most similar. The complete linkage method considers the maximum
intercluster dissimilarity and the single linkage method uses minimum intercluster dissimilarity. As such,
the single method tends to fuse observations on one at a time resulting in less balanced clusters
calculation to determine a split in classification vs regression tree
Regression trees determine splits by first measuring the residual sum of squares errors between the
target and predicted values.
Classification decision trees measure impurity using entropy, Gini index, or classification error. These
measures all attempt to increase the homogeneity of the target variable at each split
accuracy vs auc
Accuracy is measured by the ratio of correct number of predictions to total number of predictions made.
the classifications based on a fixed cutoff
point.
AUC performance across the full range of thresholds while accuracy measures performance
only at the selected threshold.
Explain how changing the link function in the GLM impacts the model fitting and how
this can impact predictor significance
The link function specifies a functional relationship between the linear predictor and the mean of the
distribution of the outcome conditional on the predictor variables. Different link functions have different
shapes and can therefore fit to different nonlinear relationships between the predictors and the target
variable.
When the link function matches the relationship of a predictor variable, the mean of the outcome
distribution (the prediction) will generally be closer to the actual values for the target variable, resulting
in smaller residuals and more significant p-values.
Proxy variable
Proxy variables are variables that are used in place of other information, usually because the desired
information is either impossible or impractical to measure. For a variable to be a good proxy it must
have a close relationship with the variable of interest
why potential legal or ethical concerns
including whether proxy variables should be used
Data such as race, age, and income are generally considered sensitive information. Some jurisdictions
have legal constraints on the use of sensitive information. Before proceeding there should be
consideration of any applicable law. There are no clear rules for what ethical use of data is. Good
professional judgement must be used to ensure that inappropriate discrimination is not occurring within
the model or the project. Public perception should also be considered. The politician or the city could
suffer bad press if there is a belief that the project inappropriately discriminates.
Stepwise selection and regularization pt 2
Stepwise selection takes iterative steps, either from no predictors or from a model with all predictor. The selected model adds or drops predictors until there is no improvement as measured by AIC.
Shrinkage methods fits coefficient for all predictors to optimize a loss function that includes a penalty parameter that penalizes large coefficients. Shrinkage methods can reduce the size of coefficients without eliminating variables.
prescriptive analytics
Study that emphasizes the outcomes or consequences of decisions made in model or implementation.