Decision Trees Flashcards
Decision Tree - decision making
- Start with whole data and create all possible binary decisions based on each feature:
- discrete feature: is this class or no class?
- continuous feature: threshold < value or threshold > value - calculate the gini impurity for every decision
- pick the decision which reduces the impurity the most
classification trees vs regression treees
classification: output are discrete. Leaf values are set to the most common outcomes
regression: output are numerical. Leaf values are set to the mean value in outcomes. Use MSE or RSS instead instead of gini
Decision tree - how to avoid overfitting
Prepruning (prune while you build the tree)
- leaf size: stop splitting when examples get small enough
- depth: stop splitting at a certain depth
- purity: stop splitting if enough of examples are the same class
- gain threshold: stop splitting when the information gain becomes too small
Postpruning (prune after you’ve finished building the tree)
- merge leaves if doing so decreases test-set error
ensemble methods
combining many weak models to form a strong model.
We train multiple models on the data, each model is different. They could be trained on different subsets of the data, or trained in different ways, or even be completely different types of models.
In order for ensemble to work, each model have to be capturing something new and different so they can add incremental insights
Decision Tree - bagging
creating each model from a bootstrap sample and aggregating the results. Can be used with any sort of model, but generally with decision trees
Random Forest
It takes bagging but doesn’t just bootstrap rows, but it also picks random set of features, and random features to split at. So some of these trees are split on more important features while others are forced to split on less important features
Random Forest - pros and cons
pros
- no feature scaling needed
- good performance
- model nonlinear relationships
cons
- can be expensive to train
- not interpretable (no inference)
Gradient Boosting Regressor
Goal: to minimize sum of square errors.
start with the mean, subtract from y, then use the residual to build a tree. Outcome = residual, input = features with a learning rate. The learning rate is to slow down the reduction in residuals so we can be more precise in our prediction.
good for capturing non-linearity
Gradient Boosting Regressor Hyperparameters
- loss - controls the loss function to minimize
- n_estimators - how many decision trees to grow
- learning_rate - start with 0.1 and go down
- max_depth - how deep to grow each tree
- subsample - similar to bagging in random forest. 1 = use 100% of data 0.5 = 50% etc
Gradient Boosting Classifier
Goal: minimize the residual between y and the probability of class y (aka predict_proba)
Optimization
Throughout machine learning we have a constant goal to find the model the best predicts the target from the features. We generally define best as minimizing some cost function or maximizing a score function.
derivative
slope of the line - when our graph is non-linear and we want to find out a the slope of a specific point on the non-linear graph, we can find the slope by calculating the derivative
gradient descend
gradient gives us the direction of the deepest decrease. Gradient descend is using gradient to point us to the direction, and continue to follow the decrease until we hit the bottom.
We can apply a learning rate to make the steps go smaller.
If our learning rate is low enough, gradient descend should lead us to the global minimum
Neural networks - forward propagation
We calculated the outcome based on features values and weights by passing through different layers to arrive at the outcome neuron
Neural networks - backward propagation
Moving from the end back to beginning. We need to find the optimal weight to minimize the error at the end by applying gradient descend