ML Flashcards
Decision Trees
Training – Entropy-based Methods Strengths – Easy to Understand – Easy to Implement Weaknesses – Straightforward for categorical data... data discretization needed – Tendency to overtraining
TDIDT
Top Down Induction of Decision Trees
AlsoknownasID3(Quinlan)
• To construct decision tree T from learning set S:
– If all examples in S belong to some class C Then make leaf labeled C
– Otherwise
• select the “most informative” attribute A • partition S according to A’s values
• recursively construct subtrees T1, T2, …, for the subsets of S (one per each value of A)
TDIDT Attribute Selection Criteria
• Main principle – Select attribute which partitions the learning set into subsets as “pure” as possible (i.e. homogeneous in terms of class label) • Various measures of purity – Information-theoretic – Giniindex – X2 – ReliefF
Training error
classification error estimated on the training data. Measure of how well the model fits the training set
Test error
classification error estimated on the test data, which is independent from the data the model was trained on. Measure of how well the model fits new data
Generalization
how well does a learned model generalize from the data it was trained on to a completely new dataset?
Variance: how much do models estimated from different training sets differ from each other?
Bias
how much does the average model overall training sets differ from the true model?
• Error due to inaccurate assumptions/simplifications made by the model
Variance
how much do models estimated from different training sets differ from each other?
To reduce variance:
• Choose a simpler classifier
• Regularize the parameters
• Get more training data (if you can…)
Underfitting
the learned model is too “simple” to represent all the relevant class characteristics – High training error and high test error – High bias and low variance
Overfitting
model is too “complex” and fits irrelevant characteristics (noise) in the data
– Low training error and high test error
– Low bias and high variance
Naïve Bayes Classifier
X = (x1, x2, …, xn) := attribute vector
Suppose there are m classes C1, C2, …, Cm
P(Ci|X)=P(X|Ci)P(Ci)
Assumption: attributes are conditionally independent.
Advantages
– Easy to implement, especially for categorical data
– Very fast to train/test
• Disadvantages
– Assumption: class conditional independence, therefore
loss of accuracy
– Practically, dependencies exist among variables, which cannot be modeled by the classifier
• E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
SVM
Support Vector Machine
• non-probabilistic binary linear classifier
• constructs a hyperplane or set of hyperplanes in a
high-dimensional space
• can be used for either classification or regression
• Pros
– Many publicly available SVM packages (e.g. LibSVM)
– Kernel-based framework is very powerful, flexible
– SVMs work very well in practice, even with very small training sample sizes
• Cons
– No “direct” multi-class SVM, must combine two-class SVMs
– Computation, memory
• Learning can take a very long time for large-scale problems
– Tuning kernel’s parameters might be tricky
SVM: The maximum margin linear classifier
The maximum margin linear classifier is the linear classifier able to separate the training data points with the maximum margin.
Hard Margin:
• Requires all the training data points to be classified correctly
- No training error
Soft Margin:
• Slack variables ξi can be added to allow the misclassification of difficult or noisy examples (decreased overfitting).
Non linear SVM
General Idea: The original input space can always be mapped to some higher-dimensional feature space where the training set is separable.
A dataset that is not linearly separable in R𝑁 may be linearly separable in a higher- dimensional space R𝑀 (where M > N).
However, consider the computational and memory consequences of increasing the dimensionality from N to M!
SVM has no need to explicitly work in the higher-dimensional space at training or testing time.
SVM kernel trick:
By using a kernel function K(𝒙𝒊 , 𝒙𝒋) we can implicitly transform datasets to a higher- dimensional space using no extra memory, and with a minimal effect on computation time
- The only effect on computation is the extra time required to compute K(𝒙𝒊 , 𝒙𝒋)
- By virtue of (1), we can efficiently learn nonlinear decision boundaries for SVMs simply by replacing all dot products 𝒙𝒊 , 𝒙𝒋 in the optimization problem with K(𝒙𝒊 , 𝒙𝒋)
Unfortunately, choosing the ‘correct’ kernel is a nontrivial task, and may depend on the specific task at hand.
No matter which kernel it is chosen, it is necessary to tune the kernel parameters to get good performance