Machine Learning Flashcards

Question 1

Q

What is a confidence interval?

Answer

A

A 95% confidence interval is defined as the range of values that, with 95% probability, will contain the true value of the unknown parameter.

Question 2

Q

What is residual standard error?

Answer

A

It is the estimate of the standard deviation of epsilon. It is the average amount by which the response will deviate from the true regression line.

Question 3

Q

Define R2 statistics?

Answer

A

Proportion of variability in Y that can be explained by X.
R2 = (TSS - RSS)/TSS

Question 4

Q

Define population regression line?

Answer

A

Best linear approximation of the true linear relationship between X and Y.

Question 5

Q

Why is F-statistics preferred over t-statistics in multiple regression?

Answer

A

F-statistics adjusts for the number of predictors

Question 6

Q

Types of variable selection methods

Answer

A

Forward selection, backward selection and variable selection

Question 7

Q

What are the assumptions for linear regression model?

Answer

A

Linear relationship between X and Y.
Uncorrelated error terms
Constant variance of the error terms
No outliers
No leverage points
No collinearity

Question 8

Q

Difference between prediction intervals and confidence intervals?

Answer

A

Confidence intervals determines how close y’ is to f(x) and prediction intervals determines how close y’ is to y. Prediction intervals are wider than confidence intervals because they include both irreducible and reducible error.

Question 9

Q

Explain how decision trees work in brief?

Answer

A

Decision trees divided the predictor space into distinct and overlapping regions. We use recursive binary splitting to construct a decision tree. It is a top down greedy approach where we start at the top when all the observations belong to the same region. As we move down the tree, we successively split the region into two new branches.

Question 10

Q

What is tree pruning?

Answer

A

Tree pruning is done to avoid overfitting. We grow a very large tree and then prune it back to obtain a subtree.

Question 11

Q

What are some measures that can be used for splitting a node in tree-based methods?

Answer

A

classification error rate: fraction of training observations in a region that do not belong to the most common class in that region.
Gini index: measure of total variance across the K classes. It is also a measure of node purity. A small value indicates that the node is pure. E = sum ( p_mk * (1 - p_mk) )
Entropy: similar to gini index entropy is also sensitive to node purity. E = - sum ( p_mk * log(p_mk) )

Question 12

Q

Why is node purity important?

Answer

A

High node purity indicates higher confidence in the predictions of the model.

Question 13

Q

Why are random forest, bagging and boosting more commonly used than decision trees?

Answer

A

Decision trees tend to have high variance. In ensemble modelling, we combine multiple weak learners which helps to reduce the variance in the final model.

Question 14

Q

Why is recursive binary splitting is a greedy approach?

Question 15

Q

What are the advantages and disadvantages of decision trees?

Answer

A

Advantages:
1. easy to interpret.
2. can capture non linear relationship between predictor variables and response.
3. can handle qualitative predictors without creating dummy variables.

Disadvantages:
1. high variance
2. non-robust.

Question 16

Q

How does random forest improves over bagging?

Answer

Study These Flashcards

A

Random forests provides an improvement over bagging in the reduction of variance by decorrelating the trees. In bagging, if we have a strong predictor in our set then all the trees will look alike as they use the predictor in the top split. Random forests prevents this from happening by using a subset predictors at each split.

Question 17

Q

How can we measure variable importance in tree-based methods?

Answer

Study These Flashcards

A

By averaging over the reduction in RSS or gini index when split is performed across a particular predictor.

Machine Learning Flashcards

(17 cards)