ML Flashcards

1
Q

Decision Trees

A
Training
– Entropy-based Methods
Strengths
– Easy to Understand – Easy to Implement
Weaknesses
– Straightforward for categorical data... data discretization
needed
– Tendency to overtraining
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

TDIDT

A

Top Down Induction of Decision Trees
AlsoknownasID3(Quinlan)
• To construct decision tree T from learning set S:
– If all examples in S belong to some class C Then make leaf labeled C
– Otherwise
• select the “most informative” attribute A • partition S according to A’s values
• recursively construct subtrees T1, T2, …, for the subsets of S (one per each value of A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

TDIDT Attribute Selection Criteria

A
• Main principle
– Select attribute which partitions the learning set into subsets as
“pure” as possible (i.e. homogeneous in terms of class label)
• Various measures of purity
– Information-theoretic 
– Giniindex
– X2
– ReliefF
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Training error

A

classification error estimated on the training data. Measure of how well the model fits the training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Test error

A

classification error estimated on the test data, which is independent from the data the model was trained on. Measure of how well the model fits new data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Generalization

A

how well does a learned model generalize from the data it was trained on to a completely new dataset?

Variance: how much do models estimated from different training sets differ from each other?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Bias

A

how much does the average model overall training sets differ from the true model?
• Error due to inaccurate assumptions/simplifications made by the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Variance

A

how much do models estimated from different training sets differ from each other?

To reduce variance:
• Choose a simpler classifier
• Regularize the parameters
• Get more training data (if you can…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Underfitting

A
the learned model is too “simple” to represent all the relevant class characteristics
– High training error and high test error 
– High bias and low variance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Overfitting

A

model is too “complex” and fits irrelevant characteristics (noise) in the data
– Low training error and high test error
– Low bias and high variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Naïve Bayes Classifier

A

X = (x1, x2, …, xn) := attribute vector
Suppose there are m classes C1, C2, …, Cm
P(Ci|X)=P(X|Ci)P(Ci)
Assumption: attributes are conditionally independent.

Advantages
– Easy to implement, especially for categorical data
– Very fast to train/test
• Disadvantages
– Assumption: class conditional independence, therefore
loss of accuracy
– Practically, dependencies exist among variables, which cannot be modeled by the classifier
• E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

SVM

A

Support Vector Machine
• non-probabilistic binary linear classifier
• constructs a hyperplane or set of hyperplanes in a
high-dimensional space
• can be used for either classification or regression

• Pros
– Many publicly available SVM packages (e.g. LibSVM)
– Kernel-based framework is very powerful, flexible
– SVMs work very well in practice, even with very small training sample sizes
• Cons
– No “direct” multi-class SVM, must combine two-class SVMs
– Computation, memory
• Learning can take a very long time for large-scale problems
– Tuning kernel’s parameters might be tricky

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

SVM: The maximum margin linear classifier

A

The maximum margin linear classifier is the linear classifier able to separate the training data points with the maximum margin.
Hard Margin:
• Requires all the training data points to be classified correctly
- No training error
Soft Margin:
• Slack variables ξi can be added to allow the misclassification of difficult or noisy examples (decreased overfitting).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Non linear SVM

A

General Idea: The original input space can always be mapped to some higher-dimensional feature space where the training set is separable.
A dataset that is not linearly separable in R𝑁 may be linearly separable in a higher- dimensional space R𝑀 (where M > N).
However, consider the computational and memory consequences of increasing the dimensionality from N to M!
SVM has no need to explicitly work in the higher-dimensional space at training or testing time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

SVM kernel trick:

A

By using a kernel function K(𝒙𝒊 , 𝒙𝒋) we can implicitly transform datasets to a higher- dimensional space using no extra memory, and with a minimal effect on computation time

  1. The only effect on computation is the extra time required to compute K(𝒙𝒊 , 𝒙𝒋)
  2. By virtue of (1), we can efficiently learn nonlinear decision boundaries for SVMs simply by replacing all dot products 𝒙𝒊 , 𝒙𝒋 in the optimization problem with K(𝒙𝒊 , 𝒙𝒋)

Unfortunately, choosing the ‘correct’ kernel is a nontrivial task, and may depend on the specific task at hand.
No matter which kernel it is chosen, it is necessary to tune the kernel parameters to get good performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Decision Trees: Information-Theoretic Approach for Attribute Selection

A

Object classification needs some amount of information (I).
After we have learned the value of attribute A, we need the remaining amount of information to classify the object (Ires).
The most ‘informative’ attribute is the one that minimizes Ires, i.e., maximizes Gain.

Average amount of information needed to classify an object is given by the entropy:   I = - ∑ p(c)log2(c), where p(c) is the proportion of examples of class c
If all examples belong to the same category, I = 0, if the examples are equally mixed (maximum entropy) I = 1.

After applying attribute A, S is partitioned into subsets according to values v of A
• Ires is equal to weighted sum of the amounts of information for the subsets

17
Q

Esemble classifiers

A

Majority voting: the final decision corresponds to the decision of the majority of the base learners

Main challenge is to promote diversity: base classifier should not be highly accurate, but should make different kind of mistakes.

Some popular ensemble approaches:
– Bootstrap aggregating (bagging): Create different training sets by repeatedly randomly resampling the original training data, than used to train base learners; decreases error especially of unstable learners
– Boosting: build a strong learner by combining several weak learners; instead of resampling re-weigh examples; at each iteration, a new weak learner is trained, and the training samples are re-weighted to focus the system on examples that the most recently learned classifier got wrong; final classification based on weighted voting of weak classifiers;
• Example: Adaboost
– Random Forests