Lecture 6 - Support Vector Machines, Decision Tree, Ensemble, Hypothesis Testing Flashcards
Are Support Vector Machines(SVM) used for supervised or unsupervised learning?
SVM is a set of related supervised learning methods
Is SVM used for Regression or Classification?
TRICK QUESTION: It’s used for both Regression and Classification.
How does SVM work on nonlinear data?
It adds a dimension, and then separates the data into two linearly separated groups.
What are some of the strengths and weaknesses of SVM?
Strenghts:
- Easy Training
- No Local optimal
- Scales well
- Trade-off between classifier complexity and error can be controlled explicitly
Weakness
- Efficiency depends on choosing kernel function
True or False: Changing Kernel will not give different results. Changing kernel is only related to the speed of the functions.
FALSE: Changing Kernel will give different results!
True or False: Decision trees can perform both classification and regression tasks
TRUE
True or False: Decision trees can be understood as a lot of “if/else” statements
TRUE
Explain how decision trees are structured
It starts with a root node, that splits into different Decision nodes (Each node represents a question that split data). These can be seen as different branches/sub-trees. After a number of these decision nodes, they end up at a certain terminal node which represents an output.
What are some of the advantages of decision trees?
Simple to understand and interpret
It implicitly performs feature selection
It can handle both numerical and categorical data
Require relatively little effort for data preparation
Non-linear relationships do not impact the model’s performance
Has many use cases
What are some of the disadvantages of decision trees?
Can overfit data
Decision trees can become unstable, as small variance in data can cause the decision tree to become unstable.
Greedy algorithms cannot guarantee to return the global optimal tree (Can be resolved by training multiple trees)
Describe the Decision Tree Training Algorithm
Given a set of labeled training instances:
1. If all the training instances have same class, create a leaf with that class label and exit.
2 ELSE Pick the best test to split the data on
3. Split the training set according to the value of the outcome of the test
4. Recursively repeat step 1-3 on each subset of the training data
What is Gini used for?
Gini is a measure of impurity showing how pure or how similar the classes of the observations in a given node are
So, if there is a test and one branch leads to two options, the ideal scenario would be to have one entire class in one option, and the other class in the other option. In this case, the GINI index would be 0.
What is Entropy?
Entropy is used as an impurity measure (pretty similar to the GINI index, only the calculations differ)
- A set’s entropy is zero when it contains instances of only one class.
- Reduction of entropy is called an information gain (Shannon’s information theory)
What does “Ensemble” mean?
A group of something (Musicians, actors, decision trees;)
Why is Random Forest an Ensemble model?
Random forest models multiple decision trees (Therefore, an ENSEMBLE of decision trees)
How does a Random Forest work?
In classification:
It creates multiple Decision Trees. Each decision tree ultimately votes for one classification. The classification with most votes is the one selected
In regression:
It creates multiple Decision Trees. Each decision tree ultimately votes for one value. It then chooses the average value predicted by all decision trees.
What are the advantages of using Random Forests?
- Can be used for both classification and regression tasks
- Handles missing values and maintains accuracy for missing data
- Wont overfit the model (Not fully true: Can use hyperparameter tuning to avoid overfitting)
- Handles large datasets with higher dimensionality
What are the disadvantages of using Random Forests?
- Does better at Regression than classification(Not sure if this is true?)
- You have very little control on what the model does
Walk me through the Random Forest algorithm (pseudocode)
- Assume number of cases in the training set is N. Then, sample of these N cases is taken at random but with replacement.
- If there are M input variables or features, a number m
What is Bagging?
Bagging is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy as well as reduce variance
Bagging builds many independent predictors and combines then using some model average techniques (avg, majority vote)
What is Boosting?
Boosting is a machine learning ensemble meta-algorithm primarily for reducing bias and also variance in supervised learning
Boosting is when predictors are not made independently but sequentially
What are the steps (conceptually) of Gradient Boosting?
- Fit an additive model (ensemble) in a forward stage-wise manner
- In each stage, introduce a weak learner to compensate the shortcomings of existing weak learners
* In Gradient Boosting, “shortcomings” are identified by gradients
* Gradients tell us how to improve the model
Here’s a very simple conceptualisation of how gradient boosting works
Diana is good at guessing ages of people (She’s the base Predictor)
Simon has noticed that when Diana guesses the ages of men, she usually undershoots by around -3 years, and when she guesses women, she overshoots by +3 years. (Therefore, Simon is a gradient, that helps the shortcoming of the predecessor)
Yannic has noticed that Diana and Simon are guessing wrong if the person is from Sweden. He found that they usually overshoot by around one year (Yannic is another gradient that helps the shortcoming of the predecessors)
Now, they are trying to guess the age of a woman called Olivia.
Diana guesses 23, Simon notices that Olivia is a woman, and tells us to -3 and Yannic notices that Olivia is swedish and tells us to -1
Therefore, the Ensemble has chosen this as Olivias age:
23-3-1 = 19
What is a support vector in SVM?
Support vectors are the datapoints that the margin of the hyperplane pushes up against (so, the points that are closest to the opposite class)
What Kernels can you use in SVM?
linear, polynomial, RBF and sigmoid
Define a Kernel function
A Kernel function is a function that takes as inputs vectors in the original space and returns the dot product of the vectors in the feature space
A Kernel function is used in SVM to map every point into a higher dimensional space via a transformation
How to choose which Kernel type to use?
You can try to visually inspect the data with the different types of Kernel and see which one sets proper boundaries for the groups.
Or, usually, RBF is a good start if you don’t really have expertise knowledge
Define a decision tree.
A decision tree is a flow-chart like structure where each internal node denotes a test on an attribute, each branch represents an outcome of a test, and each leaf (terminal node) holds a class label
What methods do decision trees use to decide where to split the nodes?
Gini Index, Chi-square, Information gain, Entropy…
The algorithm calculates the weighted GINI (the total GINI for a tree), and then explores other values and features to serve as tests in the nodes, and eventually chooses the tree that has the lowest GINI score
What are the steps (programmatically) of GB?
- Ensemble has one tree, its predictions are the same as the first tree’s prediction
- A new tree is trained on the residual errors of the first tree. The ensemble’s prediction is equal to the sum of all predictions until now
- Another tree is trained on the residuals of the previous one
There is this learning rate in GB that we wrote nothing about. Remember to read more on that. (comes from the learning rate of gradient boosting I think)
OK. The learning rate corresponds to how quickly the error is corrected from each tree to the next and is a simple multiplier
When doing Gradient Boosting, the initial “predictions” are calculated differently in Classification and Regression problems. What is the difference?
Classification: calculates the log(odds) and then transforms it into a probability to compute residuals
Regression: calculates the average and then uses it to compute residuals
What is a Hypothesis space?
The set of all hypothesis that can be produced by learning algorithm
Hypothesis space is a set of all possible finite discrete functions
Every finite discrete function can be represented by some decision tree
What kinds of functions can decision trees express?
- Boolean functions can be fully expressed
- Some functions are harder to encode
- Functions with real-valued features
What is a Null Hypothesis(H0)?
Null hypothesis is a statement about population parameter that is assumed to be true unless there is convincing evidence to the contrary
What is an Alternative Hypothesis(Ha)?
Alternative hypothesis is a statement about population parameter that is contradictory to H0 and is accepted as true only if there is convincing evidence in favor of it.
Hypothesis testing is a…
statistical precedure in which a choice is made between a H0 and a Ha, based on information in a sample.
Result: Reject H0 (and therefore accept Ha) or Fail to reject H0(and therefore fail to accept Ha)
True or False? When we fail to reject the null hypothesis, we consider it true.
False. When you fail to reject the null hypothesis, you are not able to make any inferences about the population mean
What are the two types of error in hypothesis testing?
Error type I : we decide to reject the null, but the null hypothesis is in fact true
Error type II: we decide to not reject the null, but the null is in fact false