ML Flashcards

Question 1

Q

Decision Trees

Answer

A

Training
– Entropy-based Methods
Strengths
– Easy to Understand – Easy to Implement
Weaknesses
– Straightforward for categorical data... data discretization
needed
– Tendency to overtraining

Question 2

Q

TDIDT

Answer

A

Top Down Induction of Decision Trees
AlsoknownasID3(Quinlan)
• To construct decision tree T from learning set S:
– If all examples in S belong to some class C Then make leaf labeled C
– Otherwise
• select the “most informative” attribute A • partition S according to A’s values
• recursively construct subtrees T1, T2, …, for the subsets of S (one per each value of A)

Question 3

Q

TDIDT Attribute Selection Criteria

Answer

A

• Main principle
– Select attribute which partitions the learning set into subsets as
“pure” as possible (i.e. homogeneous in terms of class label)
• Various measures of purity
– Information-theoretic 
– Giniindex
– X2
– ReliefF

Question 4

Q

Training error

Answer

A

classification error estimated on the training data. Measure of how well the model fits the training set

Question 5

Q

Test error

Answer

A

classification error estimated on the test data, which is independent from the data the model was trained on. Measure of how well the model fits new data

Question 6

Q

Generalization

Answer

A

how well does a learned model generalize from the data it was trained on to a completely new dataset?

Variance: how much do models estimated from different training sets differ from each other?

Question 7

Q

Bias

Answer

A

how much does the average model overall training sets differ from the true model?
• Error due to inaccurate assumptions/simplifications made by the model

Question 8

Q

Variance

Answer

A

how much do models estimated from different training sets differ from each other?

To reduce variance:
• Choose a simpler classifier
• Regularize the parameters
• Get more training data (if you can…)

Question 9

Q

Underfitting

Answer

A

the learned model is too “simple” to represent all the relevant class characteristics
– High training error and high test error 
– High bias and low variance

Question 10

Q

Overfitting

Answer

A

model is too “complex” and fits irrelevant characteristics (noise) in the data
– Low training error and high test error
– Low bias and high variance

Question 11

Q

Naïve Bayes Classifier

Answer

A

X = (x1, x2, …, xn) := attribute vector
Suppose there are m classes C1, C2, …, Cm
P(Ci|X)=P(X|Ci)P(Ci)
Assumption: attributes are conditionally independent.

Advantages
– Easy to implement, especially for categorical data
– Very fast to train/test
• Disadvantages
– Assumption: class conditional independence, therefore
loss of accuracy
– Practically, dependencies exist among variables, which cannot be modeled by the classifier
• E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.

Question 12

Q

SVM

Answer

A

Support Vector Machine
• non-probabilistic binary linear classifier
• constructs a hyperplane or set of hyperplanes in a
high-dimensional space
• can be used for either classification or regression

• Pros
– Many publicly available SVM packages (e.g. LibSVM)
– Kernel-based framework is very powerful, flexible
– SVMs work very well in practice, even with very small training sample sizes
• Cons
– No “direct” multi-class SVM, must combine two-class SVMs
– Computation, memory
• Learning can take a very long time for large-scale problems
– Tuning kernel’s parameters might be tricky

Question 13

Q

SVM: The maximum margin linear classifier

Answer

A

The maximum margin linear classifier is the linear classifier able to separate the training data points with the maximum margin.
Hard Margin:
• Requires all the training data points to be classified correctly
- No training error
Soft Margin:
• Slack variables ξi can be added to allow the misclassification of difficult or noisy examples (decreased overfitting).

Question 14

Q

Non linear SVM

Answer

A

General Idea: The original input space can always be mapped to some higher-dimensional feature space where the training set is separable.
A dataset that is not linearly separable in R𝑁 may be linearly separable in a higher- dimensional space R𝑀 (where M > N).
However, consider the computational and memory consequences of increasing the dimensionality from N to M!
SVM has no need to explicitly work in the higher-dimensional space at training or testing time.

Question 15

Q

SVM kernel trick:

Answer

A

By using a kernel function K(𝒙𝒊 , 𝒙𝒋) we can implicitly transform datasets to a higher- dimensional space using no extra memory, and with a minimal effect on computation time

The only effect on computation is the extra time required to compute K(𝒙𝒊 , 𝒙𝒋)
By virtue of (1), we can efficiently learn nonlinear decision boundaries for SVMs simply by replacing all dot products 𝒙𝒊 , 𝒙𝒋 in the optimization problem with K(𝒙𝒊 , 𝒙𝒋)

Unfortunately, choosing the ‘correct’ kernel is a nontrivial task, and may depend on the specific task at hand.
No matter which kernel it is chosen, it is necessary to tune the kernel parameters to get good performance

Question 16

Q

Decision Trees: Information-Theoretic Approach for Attribute Selection

Answer

Study These Flashcards

A

Object classification needs some amount of information (I).
After we have learned the value of attribute A, we need the remaining amount of information to classify the object (Ires).
The most ‘informative’ attribute is the one that minimizes Ires, i.e., maximizes Gain.

Average amount of information needed to classify an object is given by the entropy:   I = - ∑ p(c)log2(c), where p(c) is the proportion of examples of class c
If all examples belong to the same category, I = 0, if the examples are equally mixed (maximum entropy) I = 1.

After applying attribute A, S is partitioned into subsets according to values v of A
• Ires is equal to weighted sum of the amounts of information for the subsets

Question 17

Q

Esemble classifiers

Answer

Study These Flashcards

A

Majority voting: the final decision corresponds to the decision of the majority of the base learners

Main challenge is to promote diversity: base classifier should not be highly accurate, but should make different kind of mistakes.

Some popular ensemble approaches:
– Bootstrap aggregating (bagging): Create different training sets by repeatedly randomly resampling the original training data, than used to train base learners; decreases error especially of unstable learners
– Boosting: build a strong learner by combining several weak learners; instead of resampling re-weigh examples; at each iteration, a new weak learner is trained, and the training samples are re-weighted to focus the system on examples that the most recently learned classifier got wrong; final classification based on weighted voting of weak classifiers;
• Example: Adaboost
– Random Forests

ML Flashcards

(17 cards)