Week 8 - ensemble and AutoML Flashcards

Question 1

Q

why can ensemble methods be useful?

Answer

A

sometimes we don’t know if complex non-linear models will do best, or if simple linear models will do best
They can protect against overfitting

Question 2

Q

What is a deicison tree?

Answer

A

A tree with nodes that makes a yes/no decision based on a specific yes no question

Question 3

Q

what is gini impurity in the context of decision tree

Answer

A

If we randomly pick a datapoint in our dataset and classify it based in the class distribution of the dataset, what is the probability that this is incorrect:

The gini impurity, for any split, shows the proportion of data belonging to each class, according to each option of the ‘question’ or ‘split;

To calculate the gini impurity, you first iterate over classes. Then you get the proportion of data that belongs to each class. Then you multiply this by 1 - the proportion

For a node, a lower Gini impurity indicates that the node contains mostly elements from a single class (more pure), while a higher Gini impurity indicates a more mixed node (less pure).

We can combine the gini impurities from the nodes of each feature to calculate the gini impurity of each feature.

We decide to split our tree where the value of gini impurity is lowest. So the first split in the decision tree represents the feature with the lowest gini impurity.

Question 4

Q

How does gini impurity work with continuous variables?

Answer

A

You try out every possible split/threshold and calculate the gini for it

The one with the lowest gini determines our split

Question 5

Q

what is a problem with decision trees?

Answer

A

Decision trees can be very large and very complex

As a result, they can be very prone to overfitting

Deep trees can be very problematic as they make predictions based on very specific combinations of features. You can mitigate this by limiting max_depth.

Question 6

Q

what is the random sample method of making random forent models from decision trees?

Answer

A

You randomly sample the data (bootsrapping) and then fit a decision tree to each random sample of the data

This will give us three different decision trees that will make slightly different predictions cus their trained on slightly different types of data

this can help protect against overfitting

Question 7

Q

What is the random feature method of making random forest from decision trees?

Answer

A

You take random samples of the features in the data

you then train a model on different samples of the features to make an ensemble model

Question 8

Q

How might a random forest make a prediction?

Answer

A

It might use a majority voting system

Question 9

Q

how can you assess feature importance?

Answer

A

You can look at average location (towards the root) of each feature on the trees .

This is because features with high Gini impurity tend to lie near the top of the trees, becasue they tend to be most powerful in predicting the outcome

Question 10

Q

what is bagging vs. boosting of ensembles?

Answer

A

BAGGING
- Combining random versions of many strong classifiers
- e.g random forests of full length decision trees
- this also works for regression problems

BOOSTING
- Combining many weak classifiers to make a small one
- This often uses short decision ‘stumps’ adaptively, so that each new ‘stump’ it adds is chosen adaptively
- E.g LGBM

bagging takes a random selection of strong classifiers, whereas boosting chooses the new classifiers in intelligent ways

Question 11

Q

why can decision trees outperform neural networks?

Answer

A

They are able to model very detailed non linear boundaries that MLP’s don’t

Question 12

Q

what is an example of autoML

Answer

A

predicting brain age from cortical anatomy

Question 13

Q

Week 8 - ensemble and AutoML Flashcards

(13 cards)