Machine Learning Flashcards
What is a confidence interval?
A 95% confidence interval is defined as the range of values that, with 95% probability, will contain the true value of the unknown parameter.
What is residual standard error?
It is the estimate of the standard deviation of epsilon. It is the average amount by which the response will deviate from the true regression line.
Define R2 statistics?
Proportion of variability in Y that can be explained by X.
R2 = (TSS - RSS)/TSS
Define population regression line?
Best linear approximation of the true linear relationship between X and Y.
Why is F-statistics preferred over t-statistics in multiple regression?
F-statistics adjusts for the number of predictors
Types of variable selection methods
Forward selection, backward selection and variable selection
What are the assumptions for linear regression model?
- Linear relationship between X and Y.
- Uncorrelated error terms
- Constant variance of the error terms
- No outliers
- No leverage points
- No collinearity
Difference between prediction intervals and confidence intervals?
Confidence intervals determines how close y’ is to f(x) and prediction intervals determines how close y’ is to y. Prediction intervals are wider than confidence intervals because they include both irreducible and reducible error.
Explain how decision trees work in brief?
Decision trees divided the predictor space into distinct and overlapping regions. We use recursive binary splitting to construct a decision tree. It is a top down greedy approach where we start at the top when all the observations belong to the same region. As we move down the tree, we successively split the region into two new branches.
What is tree pruning?
Tree pruning is done to avoid overfitting. We grow a very large tree and then prune it back to obtain a subtree.
What are some measures that can be used for splitting a node in tree-based methods?
- classification error rate: fraction of training observations in a region that do not belong to the most common class in that region.
- Gini index: measure of total variance across the K classes. It is also a measure of node purity. A small value indicates that the node is pure. E = sum ( p_mk * (1 - p_mk) )
- Entropy: similar to gini index entropy is also sensitive to node purity. E = - sum ( p_mk * log(p_mk) )
Why is node purity important?
High node purity indicates higher confidence in the predictions of the model.
Why are random forest, bagging and boosting more commonly used than decision trees?
Decision trees tend to have high variance. In ensemble modelling, we combine multiple weak learners which helps to reduce the variance in the final model.
Why is recursive binary splitting is a greedy approach?
What are the advantages and disadvantages of decision trees?
Advantages:
1. easy to interpret.
2. can capture non linear relationship between predictor variables and response.
3. can handle qualitative predictors without creating dummy variables.
Disadvantages:
1. high variance
2. non-robust.