Week 9 - Model ensembles Flashcards
What are the three types of models based on trees
- Decision trees
- Random forest
- Gradient boosting
What are model ensembles
The wisdom of crowds for machines. Combinations of models are known as model ensembles
if p>1/2 this implies
weak learnability
What 2 things do ensemble methods have in common
- They construct multiple, diverse predictive models from adapted versions of the training data.
- They combine the predictions of these models in some way, often by simple averaging or voting
What is bootstrapping
Bootstrapping is any test of metric that uses random sampling with replacement, and falls under the broader class of re-sampling methods. Boostrapping assigns measures of accuracy to sample estimates.
What is the 3 step process creating a bootstrap sample
- Randomly sample one observation from the set
- Write it down
- Put it back in the set
Bootstrapping contains duplicates for
Diversity
What is bootstrap AGgregating (Bagging) ensembling method
- create multiple random samples from the original data using bootstrapping
- Creating different models (learners) from different random samples of the original data
What is subspace sampling?
Encouraging the diversity in ensembles by building models from a different random subset of features instead of all features
What is random forests
Random forests is an ensemble learning approach for various tasks (regression, classification) that create various decision trees at training time and output the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individial trees.
What is boosting
Boosting is an ensemble method for reducing the error in supervised learning by converting weak learners into strong ones. The goal is that each learner can do a bit better than the previous one by iteratively considering the error rate and try to improve by focusing on the points they did not perform very well on.
How can we improvel learning with boosting?
By giving the misclassified instances a higher weight, and modifying the classifier to take these weights into account
How can we assign weights (boosting)
We want to assign half of the weights to misclassified items and the other half to the correctly classified items
* Every item 1/|D|
* Weight of all misclassified items: e
* Weight of all correctly classified items: 1-e
what is epsilon in boosting
the error rate. FP+FN/Total
What are 3 sources of misclassification erros
- Unavoidable bias
- Bias due to low expressiveness of models
- High variance
What is unavoidable bias
if instances from different classes are describd by the same feature vectors
What is bias due to low expressiveness of models
if the data is not linearly seperable then even the best liear classifier will make mistakes.
What is high variance
A model has high variance if its decision boundary is highly dependent on the training data.
What is bagging predominantly for?
Bagging is predominantly a variance-reduction technique. It is often used in combination with high-variance models such as tree models.
What is boosting predominantly for?
Boosting is primarily a bias reduction tenchnique. It is typically used with high-bias models such as linear classifiers or univariate decision trees.
What is a meta-model
a model that best combinees the predictions of base models
What is stacking
Stacking involves training a learning algorithm to combine the predictions of several other learning algorithms.
* several models are trained using the available data
* a learning algorithm is trained to make a final prediction using the predictions of the other algorithms.
Name 2 types of experiments we would conduct on ml models
- on one specific dataset
- on a varied set of datasets
what is cross-validation
Randomy partition the data in k folds, set one fold aside for testing, train a model on the remaining k-1 folds and evaluate it on the test fold. This process is repeated k times until each fold has been used for testing once.
What does cross-validation accomplish
By averaging over training sets we get a sense of the variance of the learning algorith. Once we are satisfied with the performance of our learning algorith, we can run it over the entire data set to obtain a single model.
what cross validation do we use if we have very few training instances?
leave out out cross-validation
what is leave one out cross validation
alternatevly we can we k=n and train on all but one test instance, repeated n times. This means that in each single-instance ‘fold’ our accruacy estimate ir 0 or 1, but by averaging n of those we get an approximetely normal distribution by the central limit theorem.
What is null hypothesis
is the hypothesis that there is no significant difference between specified distributions
what is significance test
A test of significance is a formal procedure for comparing observed distibutions of data with a hypothesis
what is a p-value
the probability of obtaining a measurement of a certain vlaue or higher given the null hypothesis
what is a t-test
a t-test of a type of inferential statistic used to determine if there is a significant difference between the means of two groups, which may be relate in certain feaures.
when conducting an experiment with 1 dataset
use paired t-test
When comparing the perfroamance of a pair of algorithms over multiple dataset use
Wilcoxon’s signed-rank test
What is Wilconxon’s signed-rank test
- the idea is to tank the performance differences in absolute value, from smallest to largest
- We then calculate the sum of ranks for positive and negative differences seperately and take the smaller of these sumbs as our test statistic
- Null hypothesis: two algorithms perform equally on multiple data sets
what is a critical value in Wilcoxon’s singed-rank test
The critical value (the value of the test statistic at which the p-value equals alpha) can be found in a statistical table and used for rejecting the null hypothesis.
How to compare multiple algorithms over multiple data sets
Freidman test
What is the friedman test
- The idea is to rank the performance of all k algorithms per data set, from best performance to worst performance
- R_ij -> the rank of the j-th algorithm on the i-th data set
What 3 quantities do we need to calculate in the Friedman test
- The average rank
- The sum of squared differences ( the spread between the rank centroids)
- The sum of the squared differences ( the spread over all ranks)
What is the friedman statistic?
The ration of the second to third quantity.
What is a post-hoc test
- The idea is to calculate the critical difference
- The nemenyi test calculates the critical difference