Decision trees Flashcards
explain underfitting
describes the inability of the system to to mode the complexity of the input-output relation
the causes, the model might be too simple
explain overfitting
the model is over specialized on the given dataset , the model might be too complex or too many features to optimize
what might be the cause of bad predictions
noisy data and lack of training data
what is the bayesian classifier
minimizes the classification error probability,
the idea consists of computing the posterior probability of Wi knowing X, hence,
C(x) = argmaxj [P(wj) P(x | wj)]
We can directly compute the the posterior probability P(wj | x)
or indirectly computing the density class P(x | wj)
provide some approach to compute densities
logistic regression : direct and semi-parametric method we need to estimate the Wj parameters
the indirect methods rely on estimating the unknown density function :
for parametric ones : a multivariate guassian distribution, with µk and Sk estimated from the training data
non Parametric ones, we rely on the HyperCube H(x) with volume Vcentred on x, however the issue lays on setting the size of H
what are decision trees
they usually rely on the successive splitting of the dataset , such that the classification rules are organized with extremities indicating the class
what are the types of feature we may find in decision trees
Quantitative feature: numbers
Qualitative feature: Blue, Brown…etc
Ordinal Feature : Small, large, medium
What is the measure we can use to measure the class heterogenity
Entropy:
Gini impurity index
misclassification index
Explain briefly the entroy
Measure teh class histogram uniformely
explain Briefly the misclassification index
Classification error probability for the majority class observed in node N
compare the Gini to misclassification error and entropy
Gini is simply a smooth evaluation ofthe misclassification error
entropy is an enhancement
what is the homogeneity gain
the idea consist of selecting the test that leads to minimizing the impurity
what is the homogeneity gain and it s bias
Given test T providing m possible alternatives which split node N (of size n)
into m subsets / subnodes Nj
the idea consist of selecting the test that leads to minimizing the impurity
the issue with the gain is that it favors number with large alternatives
hence to overcome that we can use the gain ratio the binary test !
what is the issue with the gain ration
favors imbalanced partitions between the different subnodes (Nj)