Block 5: Tree-based methods, bootstrap, bagging, data ethics Flashcards

Question 1

Q

Explain Boosting

Answer

A

Aim: reduce error rate by putting more weights on previously incorrectly classified

Initial weights are wi = 1/n
For loop to build M classifiers :
fit Gm(x) with weighted observations wi xi
errm = sum (wi 1{yi ≠ Gm(xi)} ) / sum (wi)
αm = log((1-errm)/errm) and wi = wi exp{αm 1{yi ≠ Gm(xi)} }
Final G*(x) = sgn { sum(m=1 to M) αm Gm(x) }

Question 2

Q

Explain classification trees

Answer

A

Build tree:
- decide which variable Xj to split and at which value s (and binary or ternary split) by minimising
min(j,s) { min(c1) {sum (Yi_R1 - c1^)^2} + min(c2) {sum (Yi_R2 - c2^)^2} } where c1^ and c2^ are respectively the centroids of R1 and R2
- stop when sum of squares is not smaller and get the final M leaves
- predict value

Question 3

Q

Explain classification trees

Answer

A

Build tree using train data:
- decide which variable Xj to split and at which value s (and binary or ternary split) by minimising
min(j,s) { min(c1) {sum (Yi_R1 - c1^)^2} + min(c2) {sum (Yi_R2 - c2^)^2} } where c1^ and c2^ are respectively the centroids of R1 and R2
- stop when sum of squares is not smaller and get the final M leaves
- or stop using cost-compl

Predict value:
- the predicted value yi^ is the centroid c^j of the leaf Rj when xi ∈ Rj

Question 4

Q

Pros and cons of Classification and Regression Tree (CART)

Answer

A

Pros: fast and simple method
Cons: lack of continuity (very volatile) and inefficient in some cases

Question 5

Q

Explain cost-complex pruning

Answer

A

create over-fitted tree T0 with for example 5 or less data points on each leaf or a minimal number of nodes
create a subtree T ⊂ T0 is T0 with collapsed nodes (regroup leaves) minimising the cost-complexity criterion:
Cα(T) = sum(m=1 to |T|) nmQm(T) + α|T| where Qm is the sum of square for leaf m (wrt centroid) and where α is a hyperparameter

Question 6

Q

Explain Bootstrap and Bagging

Answer

A

Aim : get sampling distribution statistical information (mean, variance,…) without making strong assumptions on Xi or F.

create B random sub-sample of X with replacement
eg. estimate statistic θ^(x) from estimator F^(x) = 1/n sum(i=1 to n) 1{Xi <= x)

Bagging:
f^bag(x) = 1/B sum f^b(x)

Question 7

Q

Explain Random Forest

Answer

A

Random forest is similar to Bootstrap but adding a step which takes a sample of variables to choose from at each node

Block 5: Tree-based methods, bootstrap, bagging, data ethics Flashcards

(7 cards)