Chapter 6: Classification - Alternative Techniques Flashcards
Types of Classifiers
5 Distinguishing Factors
- Binary vs Multiclass
- Deterministic vs Probabilistic
- Linear vs Nonlinear
- Global vs Local
- Generative vs Discriminative
Types of Classifiers
Binary vs Multiclass
Binary classifiers assign each data instance to one of two possible labels, typically denoted as +1 and -1.
If there are more than two possible labels available, then the technique is known as a multiclass classifier.
Types of Classifiers
Deterministic vs Probabilistic
A deterministic classifier produces a discrete-valued label to each data instance it classifies,
whereas a probabilistic classifier assigns a continuous score between 0 and 1 to indicate how likely it is that an instance belongs to a particular class, where the probability scores for all classes sum up to 1.
Types of Classifiers
Linear vs Nonlinear
A linear classifier uses a linear separating hyperplane to discriminate instances from different classes whereas a nonlinear classifier enables the construction of more complex, nonlinear decision surfaces.
Types of Classifiers
Global vs Local
A global classifier fits a single model to the entire dataset.
Unless the model is highly nonlinear, this one-size-fits-all strategy may not be effective when the relationship between the attributes and the class labels varies over the input space.
In contrast, a local classifier partitions the input space into smaller regions and fits a distinct model to training instances in each region.
While local classifiers are more flexible in terms of fitting complex decision boundaries, they are also more susceptible to the model overfitting problem.
Types of Classifiers
Generative vs Discriminative
Given a data instance x, the primary objective of any classifier is to predict the class label, y, of the data instance.
However, apart from predicting the class label, we may also be interested in describing the underlying mechanism that generates the instance belonging to every class label.
Classifiers that learn a generative model of every class in the process of predicting class labels are known as generative classifiers.
In contrast, distriminative classifiers directly predict the class labels without explicityly describing the distribution of every class label.
Rule-based classifier
A rule-based classifier uses a collection of “if … then …” rules (a rule set) to classify data instances.
Each classification rule in the rule set can be expressed in the following way:
rᵢ: (Conditionᵢ) → yᵢ
Rule-based classifiers
Classification rule
*rᵢ*: (*Conditionᵢ*) → *yᵢ*
The left-hand side of the rule is called the rule antecedent or precondition.
It contains a conjunction of attribute test conditions:
*Conditionᵢ* = (A₁ op *v*₁) ∧ (A₂ op *v*₂) ∧ (Aₖ op *v*ₖ)
Where (Aⱼ, *v*ⱼ)
is an attribute-value pair and op is a comparison operator chosen from the set {=, ≠, >, <. ≥, ≤}
The right-hand side of the rule is called the rule consequent which contains the predicted class yᵢ.
A rule r covers a data instance x if the precondition of r matches the attributes of x.
Rule-based classifiers
Measuring rule quality
The quality of a rule can be evaluated using measures such as coverage and accuracy.
Rule-based classifiers
Rule coverage
(to measure rule quality)
Given a data set D and a classification rule r: A →y,
the coverage of the rule is the fraction of instances in D that tribber the rule r.
Coverage(r) = |A| / |D|
Where |A|
is the number of instances that satisfy the rule anticedent and |D |
is the total number of instance.
Rule-based classifiers
Rule Accuracy
(to measure rule quality)
The accuracy of confidence factor is the fraction of instances triggered by r whose class labels are equal to y.
Accuracy(r) = |A ∩ y | / |A|
Where |A|
is the number of instances that satisfy the rule anticedent and |A ∩ y |
is the number of instances that satisfy both the antecedent and consequent.
Definition 6.1
Mutually Exclusive Rule Set
The rules in a rule set R are mutually exclusive if no two rules in R are triggered by the same instance.
This ensures that every instance is covered by at most one rule R.
Definition 6.2
Exhaustive Rule Set
A rule set R has exhaustive coverage if there is a rule for each combination of attribute values.
This ensures that every instance is covered by at least one rule R.
Definition 6.3
Ordered Rule Set
The rules in an ordered rule set R are ranked in decreasing order of their priority.
An ordered rule set is also known as a decision list.
RIPPER
Sequential Covering Algorithm
The algorithm starts with an empty decision list, R, and extracts rules for each class based on the ordering specified by the class prevalence.
It iteratively extracts the rules for a given class y using the Learn-One-Rule function.
Once such a rule is found, all the training instances covered by the rule are eliminated.
The new rule is added to the bottom of the decision list R.
This procedure is repeated until the stopping criterion is met.
The algorithm the proceeds to generate rules for the next class.
Algorithm 6.1
Sequential Covering Algorithm
- Let E be the training instances and A be the set of attribute-value pairs, {(Aⱼ, vⱼ)}.
- Let Y₀ be an ordered set of classes {y₁, y₂, …, yₖ}.
- Let R = {} be the initial rule set
- for each class y ∈ Y₀ - {yₖ} do:
- . while stopping condition is not met do
- . . r ← Learn-One-Rule (E, A, y)
- . . Remove training instances from E that are covered by r.
- . . Add r to the bottom of the rule list: R ← R ∨ r
- . end while
- end for
- Insert the default rule, {} → yₖ, to the bottom of the rule list R.
RIPPER
Rule Pruning
The rules generated by the Learn-One-Rule function can be pruned to improve their generatlization errors.
RIPPER prunes the rules based on their performance on the validation set.
The following metric is computed to determine whether purning is needed:
(p
-n
)/(p
+n
)
This metric is monotonically related to the rule’s accuracy on the validation set. If the metric improves after pruning, then the conjunct is removed.
p
: # positive examples & n
: # negative examples
Eager vs lazy learners
Decision tree and rule-based classifiers are examples. of eager learners because they are designed to learn a model that maps the input attributes to the class labels as soon as the training data becomes available.
An opposite strategy would be to delay the process of modelling the training data until it is needed to classify the test instances.
Techniques that employ this strategy are known as lazy learners.
Nearest-neighbour classifier
A nearest neighbour classifier represents each example as a datapoint in a d-dimensional space, where d is the number of attributes.
Given a test instance, we compute its proximity to the training instances.
The k-nearest neighbours of a given test instance z refer to the k training examples that are closest to z.
The instance is classified based on the class labels of its neighbours.
In the case where the neighbours have more than one label, the test instance is assigned to the majority class of its nearest neighbours.
8 Characteristics of Nearest Neighbour Classifiers
- They are and example of instance-based learning.
- They are lazy learners.
- Predictions are based on local information.
- Arbitrarily shaped decision boundaries.
- They have difficulty handling missing values.
- They can handle the presence of interacting attributes.
- Irrelevant attributes can distort commonly-used proximity measures.
- They require an appropriate proximity measure and data preprocessing steps.
8 Characteristics of Nearest Neighbour Classifiers
Instance-based learning
Instance-based learning does not build a global model, but rather uses the training examples to make predictions for a test instance.
Such algorithms require a proximity measure to determine the similarity or distance between instances and a classification function that returns the predicted class of a test instance based on its proximity to other instances.
8 Characteristics of Nearest Neighbour Classifiers
Lazy learnings
Although they do not require model building, classifying a test instance can be quite expensive, because we need to compute the proximity values individually between the test and training examples.
In contrast, eager learners often spend the bulk of their computing resources for model building. Once a model has been built, classifying a test instance is extremely fast.
8 Characteristics of Nearest Neighbour Classifiers
Local predictions
Nearest neighbour classifiers make their predictions based on local information.
By contrast, decision tree and rule-based classifiers attempt to find a global model that fits the entire input space.
Because the classification decisions are made locally, nearest neighbour classifiers (with small values of k) are quite susceptible to noise.
8 Characteristics of Nearest Neighbour Classifiers
Arbitrarily-shaped decision boundaries
Nearest neighbour classifiers can produce decision boundaries of arbitrary shape.
Such boundaries provide a more flexible model representation compared to decision tree and rule-based classifiers that are often constrained to rectilinear decision boundaries.
The decision boundaries of nearest neighbour classifiers also have high variability because they depend on the composition of training examples in the local neighbourhood.
Increasing the number of nearest neighbours may reduce such variability.