Entropy is used as an impurity measure (pretty similar to the GINI index, only the calculations differ) - A set's entropy is zero when it contains instances of only one class. - Reduction of entropy is called an information gain (Shannon's information theory)

Lecture 6 - Support Vector Machines, Decision Tree, Ensemble, Hypothesis Testing Flashcards by Simon Sardorf

Are Support Vector Machines(SVM) used for supervised or unsupervised learning?

SVM is a set of related supervised learning methods

How well did you know this?

Not at all

Perfectly

Is SVM used for Regression or Classification?

TRICK QUESTION: It’s used for both Regression and Classification.

How well did you know this?

Not at all

Perfectly

How does SVM work on nonlinear data?

It adds a dimension, and then separates the data into two linearly separated groups.

How well did you know this?

Not at all

Perfectly

What are some of the strengths and weaknesses of SVM?

Strenghts:

Easy Training
No Local optimal
Scales well
Trade-off between classifier complexity and error can be controlled explicitly

Weakness
- Efficiency depends on choosing kernel function

How well did you know this?

Not at all

Perfectly

True or False: Changing Kernel will not give different results. Changing kernel is only related to the speed of the functions.

FALSE: Changing Kernel will give different results!

How well did you know this?

Not at all

Perfectly

True or False: Decision trees can perform both classification and regression tasks

TRUE

How well did you know this?

Not at all

Perfectly

True or False: Decision trees can be understood as a lot of “if/else” statements

TRUE

How well did you know this?

Not at all

Perfectly

Explain how decision trees are structured

It starts with a root node, that splits into different Decision nodes (Each node represents a question that split data). These can be seen as different branches/sub-trees. After a number of these decision nodes, they end up at a certain terminal node which represents an output.

How well did you know this?

Not at all

Perfectly

What are some of the advantages of decision trees?

Simple to understand and interpret

It implicitly performs feature selection

It can handle both numerical and categorical data

Require relatively little effort for data preparation

Non-linear relationships do not impact the model’s performance

Has many use cases

How well did you know this?

Not at all

Perfectly

What are some of the disadvantages of decision trees?

Can overfit data

Decision trees can become unstable, as small variance in data can cause the decision tree to become unstable.

Greedy algorithms cannot guarantee to return the global optimal tree (Can be resolved by training multiple trees)

How well did you know this?

Not at all

Perfectly

Describe the Decision Tree Training Algorithm

Given a set of labeled training instances:
1. If all the training instances have same class, create a leaf with that class label and exit.
2 ELSE Pick the best test to split the data on
3. Split the training set according to the value of the outcome of the test
4. Recursively repeat step 1-3 on each subset of the training data

How well did you know this?

Not at all

Perfectly

What is Gini used for?

Gini is a measure of impurity showing how pure or how similar the classes of the observations in a given node are

So, if there is a test and one branch leads to two options, the ideal scenario would be to have one entire class in one option, and the other class in the other option. In this case, the GINI index would be 0.

How well did you know this?

Not at all

Perfectly

What is Entropy?

Entropy is used as an impurity measure (pretty similar to the GINI index, only the calculations differ)

A set’s entropy is zero when it contains instances of only one class.
Reduction of entropy is called an information gain (Shannon’s information theory)

How well did you know this?

Not at all

Perfectly

What does “Ensemble” mean?

A group of something (Musicians, actors, decision trees;)

How well did you know this?

Not at all

Perfectly

Why is Random Forest an Ensemble model?

Random forest models multiple decision trees (Therefore, an ENSEMBLE of decision trees)

How well did you know this?

Not at all

Perfectly

How does a Random Forest work?

Study These Flashcards

In classification:
It creates multiple Decision Trees. Each decision tree ultimately votes for one classification. The classification with most votes is the one selected

In regression:
It creates multiple Decision Trees. Each decision tree ultimately votes for one value. It then chooses the average value predicted by all decision trees.

What are the advantages of using Random Forests?

Study These Flashcards

Can be used for both classification and regression tasks
Handles missing values and maintains accuracy for missing data
Wont overfit the model (Not fully true: Can use hyperparameter tuning to avoid overfitting)
Handles large datasets with higher dimensionality

What are the disadvantages of using Random Forests?

Study These Flashcards

Does better at Regression than classification(Not sure if this is true?)
You have very little control on what the model does

Walk me through the Random Forest algorithm (pseudocode)

Study These Flashcards

Assume number of cases in the training set is N. Then, sample of these N cases is taken at random but with replacement.
If there are M input variables or features, a number m

What is Bagging?

Study These Flashcards

Bagging is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy as well as reduce variance

Bagging builds many independent predictors and combines then using some model average techniques (avg, majority vote)

What is Boosting?

Study These Flashcards

Boosting is a machine learning ensemble meta-algorithm primarily for reducing bias and also variance in supervised learning

Boosting is when predictors are not made independently but sequentially

What are the steps (conceptually) of Gradient Boosting?

Study These Flashcards

Fit an additive model (ensemble) in a forward stage-wise manner
In each stage, introduce a weak learner to compensate the shortcomings of existing weak learners
* In Gradient Boosting, “shortcomings” are identified by gradients
* Gradients tell us how to improve the model

Here’s a very simple conceptualisation of how gradient boosting works

Study These Flashcards

Diana is good at guessing ages of people (She’s the base Predictor)

Simon has noticed that when Diana guesses the ages of men, she usually undershoots by around -3 years, and when she guesses women, she overshoots by +3 years. (Therefore, Simon is a gradient, that helps the shortcoming of the predecessor)

Yannic has noticed that Diana and Simon are guessing wrong if the person is from Sweden. He found that they usually overshoot by around one year (Yannic is another gradient that helps the shortcoming of the predecessors)

Now, they are trying to guess the age of a woman called Olivia.
Diana guesses 23, Simon notices that Olivia is a woman, and tells us to -3 and Yannic notices that Olivia is swedish and tells us to -1

Therefore, the Ensemble has chosen this as Olivias age:
23-3-1 = 19

What is a support vector in SVM?

Study These Flashcards

Support vectors are the datapoints that the margin of the hyperplane pushes up against (so, the points that are closest to the opposite class)

What Kernels can you use in SVM?

linear, polynomial, RBF and sigmoid

Define a Kernel function

A Kernel function is a function that takes as inputs vectors in the original space and returns the dot product of the vectors in the feature space A Kernel function is used in SVM to map every point into a higher dimensional space via a transformation

How to choose which Kernel type to use?

You can try to visually inspect the data with the different types of Kernel and see which one sets proper boundaries for the groups. Or, usually, RBF is a good start if you don't really have expertise knowledge

Define a decision tree.

A decision tree is a flow-chart like structure where each internal node denotes a test on an attribute, each branch represents an outcome of a test, and each leaf (terminal node) holds a class label

What methods do decision trees use to decide where to split the nodes?

Gini Index, Chi-square, Information gain, Entropy... The algorithm calculates the weighted GINI (the total GINI for a tree), and then explores other values and features to serve as tests in the nodes, and eventually chooses the tree that has the lowest GINI score

What are the steps (programmatically) of GB?

1. Ensemble has one tree, its predictions are the same as the first tree's prediction 2. A new tree is trained on the residual errors of the first tree. The ensemble's prediction is equal to the sum of all predictions until now 3. Another tree is trained on the residuals of the previous one

There is this learning rate in GB that we wrote nothing about. Remember to read more on that. (comes from the learning rate of gradient boosting I think)

OK. The learning rate corresponds to how quickly the error is corrected from each tree to the next and is a simple multiplier

When doing Gradient Boosting, the initial "predictions" are calculated differently in Classification and Regression problems. What is the difference?

Classification: calculates the log(odds) and then transforms it into a probability to compute residuals Regression: calculates the average and then uses it to compute residuals

What is a Hypothesis space?

The set of all hypothesis that can be produced by learning algorithm Hypothesis space is a set of all possible finite discrete functions Every finite discrete function can be represented by some decision tree

What kinds of functions can decision trees express?

- Boolean functions can be fully expressed - Some functions are harder to encode - Functions with real-valued features

What is a Null Hypothesis(H0)?

Null hypothesis is a statement about population parameter that is assumed to be true unless there is convincing evidence to the contrary

What is an Alternative Hypothesis(Ha)?

Alternative hypothesis is a statement about population parameter that is contradictory to H0 and is accepted as true only if there is convincing evidence in favor of it.

Hypothesis testing is a...

statistical precedure in which a choice is made between a H0 and a Ha, based on information in a sample. Result: Reject H0 (and therefore accept Ha) or Fail to reject H0(and therefore fail to accept Ha)

True or False? When we fail to reject the null hypothesis, we consider it true.

False. When you fail to reject the null hypothesis, you are not able to make any inferences about the population mean

What are the two types of error in hypothesis testing?

Error type I : we decide to reject the null, but the null hypothesis is in fact true Error type II: we decide to not reject the null, but the null is in fact false

Lecture 6 - Support Vector Machines, Decision Tree, Ensemble, Hypothesis Testing Flashcards

(39 cards)