Lecture 6 - Support Vector Machines, Decision Tree, Ensemble, Hypothesis Testing Flashcards

1
Q

Are Support Vector Machines(SVM) used for supervised or unsupervised learning?

A

SVM is a set of related supervised learning methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Is SVM used for Regression or Classification?

A

TRICK QUESTION: It’s used for both Regression and Classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does SVM work on nonlinear data?

A

It adds a dimension, and then separates the data into two linearly separated groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some of the strengths and weaknesses of SVM?

A

Strenghts:

  • Easy Training
  • No Local optimal
  • Scales well
  • Trade-off between classifier complexity and error can be controlled explicitly

Weakness
- Efficiency depends on choosing kernel function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

True or False: Changing Kernel will not give different results. Changing kernel is only related to the speed of the functions.

A

FALSE: Changing Kernel will give different results!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

True or False: Decision trees can perform both classification and regression tasks

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True or False: Decision trees can be understood as a lot of “if/else” statements

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain how decision trees are structured

A

It starts with a root node, that splits into different Decision nodes (Each node represents a question that split data). These can be seen as different branches/sub-trees. After a number of these decision nodes, they end up at a certain terminal node which represents an output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some of the advantages of decision trees?

A

Simple to understand and interpret

It implicitly performs feature selection

It can handle both numerical and categorical data

Require relatively little effort for data preparation

Non-linear relationships do not impact the model’s performance

Has many use cases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some of the disadvantages of decision trees?

A

Can overfit data

Decision trees can become unstable, as small variance in data can cause the decision tree to become unstable.

Greedy algorithms cannot guarantee to return the global optimal tree (Can be resolved by training multiple trees)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe the Decision Tree Training Algorithm

A

Given a set of labeled training instances:
1. If all the training instances have same class, create a leaf with that class label and exit.
2 ELSE Pick the best test to split the data on
3. Split the training set according to the value of the outcome of the test
4. Recursively repeat step 1-3 on each subset of the training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Gini used for?

A

Gini is a measure of impurity showing how pure or how similar the classes of the observations in a given node are

So, if there is a test and one branch leads to two options, the ideal scenario would be to have one entire class in one option, and the other class in the other option. In this case, the GINI index would be 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Entropy?

A

Entropy is used as an impurity measure (pretty similar to the GINI index, only the calculations differ)

  • A set’s entropy is zero when it contains instances of only one class.
  • Reduction of entropy is called an information gain (Shannon’s information theory)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does “Ensemble” mean?

A

A group of something (Musicians, actors, decision trees;)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why is Random Forest an Ensemble model?

A

Random forest models multiple decision trees (Therefore, an ENSEMBLE of decision trees)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does a Random Forest work?

A

In classification:
It creates multiple Decision Trees. Each decision tree ultimately votes for one classification. The classification with most votes is the one selected

In regression:
It creates multiple Decision Trees. Each decision tree ultimately votes for one value. It then chooses the average value predicted by all decision trees.

17
Q

What are the advantages of using Random Forests?

A
  1. Can be used for both classification and regression tasks
  2. Handles missing values and maintains accuracy for missing data
  3. Wont overfit the model (Not fully true: Can use hyperparameter tuning to avoid overfitting)
  4. Handles large datasets with higher dimensionality
18
Q

What are the disadvantages of using Random Forests?

A
  1. Does better at Regression than classification(Not sure if this is true?)
  2. You have very little control on what the model does
19
Q

Walk me through the Random Forest algorithm (pseudocode)

A
  1. Assume number of cases in the training set is N. Then, sample of these N cases is taken at random but with replacement.
  2. If there are M input variables or features, a number m
20
Q

What is Bagging?

A

Bagging is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy as well as reduce variance

Bagging builds many independent predictors and combines then using some model average techniques (avg, majority vote)

21
Q

What is Boosting?

A

Boosting is a machine learning ensemble meta-algorithm primarily for reducing bias and also variance in supervised learning

Boosting is when predictors are not made independently but sequentially

22
Q

What are the steps (conceptually) of Gradient Boosting?

A
  1. Fit an additive model (ensemble) in a forward stage-wise manner
  2. In each stage, introduce a weak learner to compensate the shortcomings of existing weak learners
    * In Gradient Boosting, “shortcomings” are identified by gradients
    * Gradients tell us how to improve the model
23
Q

Here’s a very simple conceptualisation of how gradient boosting works

A

Diana is good at guessing ages of people (She’s the base Predictor)

Simon has noticed that when Diana guesses the ages of men, she usually undershoots by around -3 years, and when she guesses women, she overshoots by +3 years. (Therefore, Simon is a gradient, that helps the shortcoming of the predecessor)

Yannic has noticed that Diana and Simon are guessing wrong if the person is from Sweden. He found that they usually overshoot by around one year (Yannic is another gradient that helps the shortcoming of the predecessors)

Now, they are trying to guess the age of a woman called Olivia.
Diana guesses 23, Simon notices that Olivia is a woman, and tells us to -3 and Yannic notices that Olivia is swedish and tells us to -1

Therefore, the Ensemble has chosen this as Olivias age:
23-3-1 = 19

24
Q

What is a support vector in SVM?

A

Support vectors are the datapoints that the margin of the hyperplane pushes up against (so, the points that are closest to the opposite class)

25
Q

What Kernels can you use in SVM?

A

linear, polynomial, RBF and sigmoid

26
Q

Define a Kernel function

A

A Kernel function is a function that takes as inputs vectors in the original space and returns the dot product of the vectors in the feature space

A Kernel function is used in SVM to map every point into a higher dimensional space via a transformation

27
Q

How to choose which Kernel type to use?

A

You can try to visually inspect the data with the different types of Kernel and see which one sets proper boundaries for the groups.

Or, usually, RBF is a good start if you don’t really have expertise knowledge

28
Q

Define a decision tree.

A

A decision tree is a flow-chart like structure where each internal node denotes a test on an attribute, each branch represents an outcome of a test, and each leaf (terminal node) holds a class label

29
Q

What methods do decision trees use to decide where to split the nodes?

A

Gini Index, Chi-square, Information gain, Entropy…

The algorithm calculates the weighted GINI (the total GINI for a tree), and then explores other values and features to serve as tests in the nodes, and eventually chooses the tree that has the lowest GINI score

30
Q

What are the steps (programmatically) of GB?

A
  1. Ensemble has one tree, its predictions are the same as the first tree’s prediction
  2. A new tree is trained on the residual errors of the first tree. The ensemble’s prediction is equal to the sum of all predictions until now
  3. Another tree is trained on the residuals of the previous one
31
Q

There is this learning rate in GB that we wrote nothing about. Remember to read more on that. (comes from the learning rate of gradient boosting I think)

A

OK. The learning rate corresponds to how quickly the error is corrected from each tree to the next and is a simple multiplier

32
Q

When doing Gradient Boosting, the initial “predictions” are calculated differently in Classification and Regression problems. What is the difference?

A

Classification: calculates the log(odds) and then transforms it into a probability to compute residuals

Regression: calculates the average and then uses it to compute residuals

33
Q

What is a Hypothesis space?

A

The set of all hypothesis that can be produced by learning algorithm

Hypothesis space is a set of all possible finite discrete functions

Every finite discrete function can be represented by some decision tree

34
Q

What kinds of functions can decision trees express?

A
  • Boolean functions can be fully expressed
  • Some functions are harder to encode
  • Functions with real-valued features
35
Q

What is a Null Hypothesis(H0)?

A

Null hypothesis is a statement about population parameter that is assumed to be true unless there is convincing evidence to the contrary

36
Q

What is an Alternative Hypothesis(Ha)?

A

Alternative hypothesis is a statement about population parameter that is contradictory to H0 and is accepted as true only if there is convincing evidence in favor of it.

37
Q

Hypothesis testing is a…

A

statistical precedure in which a choice is made between a H0 and a Ha, based on information in a sample.

Result: Reject H0 (and therefore accept Ha) or Fail to reject H0(and therefore fail to accept Ha)

38
Q

True or False? When we fail to reject the null hypothesis, we consider it true.

A

False. When you fail to reject the null hypothesis, you are not able to make any inferences about the population mean

39
Q

What are the two types of error in hypothesis testing?

A

Error type I : we decide to reject the null, but the null hypothesis is in fact true

Error type II: we decide to not reject the null, but the null is in fact false