Lecture 6 - Support Vector Machines, Decision Tree, Ensemble, Hypothesis Testing Flashcards
Are Support Vector Machines(SVM) used for supervised or unsupervised learning?
SVM is a set of related supervised learning methods
Is SVM used for Regression or Classification?
TRICK QUESTION: It’s used for both Regression and Classification.
How does SVM work on nonlinear data?
It adds a dimension, and then separates the data into two linearly separated groups.
What are some of the strengths and weaknesses of SVM?
Strengths:
- Easy Training
- No Local optimal
- Scales well
- Trade-off between classifier complexity and error can be controlled explicitly
Weakness
- Efficiency depends on choosing kernel function
True or False: Changing Kernel will not give different results. Changing kernel is only related to the speed of the functions.
FALSE: Changing Kernel will give different results!
True or False: Decision trees can perform both classification and regression tasks
TRUE
True or False: Decision trees can be understood as a lot of “if/else” statements
TRUE
Explain how decision trees are structured
It starts with a root node, that splits into different Decision nodes (Each node represents a question that split data). These can be seen as different branches/sub-trees. After a number of these decision nodes, they end up at a certain terminal node which represents an output.
What are some of the advantages of decision trees?
Simple to understand and interpret
It implicitly performs feature selection
It can handle both numerical and categorical data
Require relatively little effort for data preparation
Non-linear relationships do not impact the model’s performance
Has many use cases
What are some of the disadvantages of decision trees?
Can overfit data
Decision trees can become unstable, as small variance in data can cause the decision tree to become unstable.
Greedy algorithms cannot guarantee to return the global optimal tree (Can be resolved by training multiple trees)
Describe the Decision Tree Training Algorithm
Given a set of labeled training instances:
1. If all the training instances have same class, create a leaf with that class label and exit.
2 ELSE Pick the best test to split the data on
3. Split the training set according to the value of the outcome of the test
4. Recursively repeat step 1-3 on each subset of the training data
What is Gini used for?
Gini is a measure of impurity showing how pure or how similar the classes of the observations in a given node are
So, if there is a test and one branch leads to two options, the ideal scenario would be to have one entire class in one option, and the other class in the other option. In this case, the GINI index would be 0.
What is Entropy?
Entropy is used as an impurity measure (pretty similar to the GINI index, only the calculations differ)
- A set’s entropy is zero when it contains instances of only one class.
- Reduction of entropy is called an information gain (Shannon’s information theory)
What does “Ensemble” mean?
A group of something (Musicians, actors, decision trees;)
Why is Random Forest an Ensemble model?
Random forest models multiple decision trees (Therefore, an ENSEMBLE of decision trees)