Chapter 6: Classification - Alternative Techniques Flashcards

1
Q

Types of Classifiers

5 Distinguishing Factors

A
  • Binary vs Multiclass
  • Deterministic vs Probabilistic
  • Linear vs Nonlinear
  • Global vs Local
  • Generative vs Discriminative
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Types of Classifiers

Binary vs Multiclass

A

Binary classifiers assign each data instance to one of two possible labels, typically denoted as +1 and -1.

If there are more than two possible labels available, then the technique is known as a multiclass classifier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Types of Classifiers

Deterministic vs Probabilistic

A

A deterministic classifier produces a discrete-valued label to each data instance it classifies,

whereas a probabilistic classifier assigns a continuous score between 0 and 1 to indicate how likely it is that an instance belongs to a particular class, where the probability scores for all classes sum up to 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Types of Classifiers

Linear vs Nonlinear

A

A linear classifier uses a linear separating hyperplane to discriminate instances from different classes whereas a nonlinear classifier enables the construction of more complex, nonlinear decision surfaces.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Types of Classifiers

Global vs Local

A

A global classifier fits a single model to the entire dataset.

Unless the model is highly nonlinear, this one-size-fits-all strategy may not be effective when the relationship between the attributes and the class labels varies over the input space.

In contrast, a local classifier partitions the input space into smaller regions and fits a distinct model to training instances in each region.

While local classifiers are more flexible in terms of fitting complex decision boundaries, they are also more susceptible to the model overfitting problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Types of Classifiers

Generative vs Discriminative

A

Given a data instance x, the primary objective of any classifier is to predict the class label, y, of the data instance.

However, apart from predicting the class label, we may also be interested in describing the underlying mechanism that generates the instance belonging to every class label.

Classifiers that learn a generative model of every class in the process of predicting class labels are known as generative classifiers.

In contrast, distriminative classifiers directly predict the class labels without explicityly describing the distribution of every class label.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Rule-based classifier

A

A rule-based classifier uses a collection of “if … then …” rules (a rule set) to classify data instances.

Each classification rule in the rule set can be expressed in the following way:

rᵢ: (Conditionᵢ) → yᵢ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Rule-based classifiers

Classification rule

A

*rᵢ*: (*Conditionᵢ*) → *yᵢ*

The left-hand side of the rule is called the rule antecedent or precondition.

It contains a conjunction of attribute test conditions:

*Conditionᵢ* = (A₁ op *v*₁) ∧ (A₂ op *v*₂) ∧ (Aₖ op *v*ₖ)

Where (Aⱼ, *v*ⱼ) is an attribute-value pair and op is a comparison operator chosen from the set {=, ≠, >, <. ≥, ≤}

The right-hand side of the rule is called the rule consequent which contains the predicted class yᵢ.

A rule r covers a data instance x if the precondition of r matches the attributes of x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Rule-based classifiers

Measuring rule quality

A

The quality of a rule can be evaluated using measures such as coverage and accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Rule-based classifiers

Rule coverage

(to measure rule quality)

A

Given a data set D and a classification rule r: A →y,

the coverage of the rule is the fraction of instances in D that tribber the rule r.

Coverage(r) = |A| / |D|

Where |A| is the number of instances that satisfy the rule anticedent and |D | is the total number of instance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Rule-based classifiers

Rule Accuracy

(to measure rule quality)

A

The accuracy of confidence factor is the fraction of instances triggered by r whose class labels are equal to y.

Accuracy(r) = |A ∩ y | / |A|

Where |A| is the number of instances that satisfy the rule anticedent and |A ∩ y | is the number of instances that satisfy both the antecedent and consequent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Definition 6.1

Mutually Exclusive Rule Set

A

The rules in a rule set R are mutually exclusive if no two rules in R are triggered by the same instance.

This ensures that every instance is covered by at most one rule R.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Definition 6.2

Exhaustive Rule Set

A

A rule set R has exhaustive coverage if there is a rule for each combination of attribute values.

This ensures that every instance is covered by at least one rule R.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Definition 6.3

Ordered Rule Set

A

The rules in an ordered rule set R are ranked in decreasing order of their priority.

An ordered rule set is also known as a decision list.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

RIPPER

Sequential Covering Algorithm

A

The algorithm starts with an empty decision list, R, and extracts rules for each class based on the ordering specified by the class prevalence.

It iteratively extracts the rules for a given class y using the Learn-One-Rule function.

Once such a rule is found, all the training instances covered by the rule are eliminated.

The new rule is added to the bottom of the decision list R.

This procedure is repeated until the stopping criterion is met.

The algorithm the proceeds to generate rules for the next class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Algorithm 6.1

Sequential Covering Algorithm

A
  1. Let E be the training instances and A be the set of attribute-value pairs, {(Aⱼ, vⱼ)}.
  2. Let Y₀ be an ordered set of classes {y₁, y₂, …, yₖ}.
  3. Let R = {} be the initial rule set
  4. for each class y ∈ Y₀ - {yₖ} do:
  5. . while stopping condition is not met do
  6. . . r ← Learn-One-Rule (E, A, y)
  7. . . Remove training instances from E that are covered by r.
  8. . . Add r to the bottom of the rule list: RRr
  9. . end while
  10. end for
  11. Insert the default rule, {} → yₖ, to the bottom of the rule list R.
17
Q

RIPPER

Rule Pruning

A

The rules generated by the Learn-One-Rule function can be pruned to improve their generatlization errors.

RIPPER prunes the rules based on their performance on the validation set.

The following metric is computed to determine whether purning is needed:

(p-n)/(p+n)

This metric is monotonically related to the rule’s accuracy on the validation set. If the metric improves after pruning, then the conjunct is removed.

p : # positive examples & n : # negative examples

18
Q

Eager vs lazy learners

A

Decision tree and rule-based classifiers are examples. of eager learners because they are designed to learn a model that maps the input attributes to the class labels as soon as the training data becomes available.

An opposite strategy would be to delay the process of modelling the training data until it is needed to classify the test instances.

Techniques that employ this strategy are known as lazy learners.

19
Q

Nearest-neighbour classifier

A

A nearest neighbour classifier represents each example as a datapoint in a d-dimensional space, where d is the number of attributes.

Given a test instance, we compute its proximity to the training instances.

The k-nearest neighbours of a given test instance z refer to the k training examples that are closest to z.

The instance is classified based on the class labels of its neighbours.

In the case where the neighbours have more than one label, the test instance is assigned to the majority class of its nearest neighbours.

20
Q

8 Characteristics of Nearest Neighbour Classifiers

A
  1. They are and example of instance-based learning.
  2. They are lazy learners.
  3. Predictions are based on local information.
  4. Arbitrarily shaped decision boundaries.
  5. They have difficulty handling missing values.
  6. They can handle the presence of interacting attributes.
  7. Irrelevant attributes can distort commonly-used proximity measures.
  8. They require an appropriate proximity measure and data preprocessing steps.
21
Q

8 Characteristics of Nearest Neighbour Classifiers

Instance-based learning

A

Instance-based learning does not build a global model, but rather uses the training examples to make predictions for a test instance.

Such algorithms require a proximity measure to determine the similarity or distance between instances and a classification function that returns the predicted class of a test instance based on its proximity to other instances.

22
Q

8 Characteristics of Nearest Neighbour Classifiers

Lazy learnings

A

Although they do not require model building, classifying a test instance can be quite expensive, because we need to compute the proximity values individually between the test and training examples.

In contrast, eager learners often spend the bulk of their computing resources for model building. Once a model has been built, classifying a test instance is extremely fast.

23
Q

8 Characteristics of Nearest Neighbour Classifiers

Local predictions

A

Nearest neighbour classifiers make their predictions based on local information.

By contrast, decision tree and rule-based classifiers attempt to find a global model that fits the entire input space.

Because the classification decisions are made locally, nearest neighbour classifiers (with small values of k) are quite susceptible to noise.

24
Q

8 Characteristics of Nearest Neighbour Classifiers

Arbitrarily-shaped decision boundaries

A

Nearest neighbour classifiers can produce decision boundaries of arbitrary shape.

Such boundaries provide a more flexible model representation compared to decision tree and rule-based classifiers that are often constrained to rectilinear decision boundaries.

The decision boundaries of nearest neighbour classifiers also have high variability because they depend on the composition of training examples in the local neighbourhood.

Increasing the number of nearest neighbours may reduce such variability.

25
Q

8 Characteristics of Nearest Neighbour Classifiers

Missing Values

A

Nearest neighbour classifiers have difficulty handling missing values in both the training and test sets since proximity computations normally require the presence of all attributes.

Although a reduced subset of attributes in two instances can be used to compute a proximity, such an approach may not produce good results since the proximity measures may be different for each pair of instances and thus hard to compare.

26
Q

8 Characteristics of Nearest Neighbour Classifiers

Irrelevant attributes

A

The presence of irrelevant attributes can distort commonly-used proximity measures, especially when the number of irrelevant attributes is large.

If there are a large number of redundant attributes that are highly correlated with each other, then the proximity measure can be overly biased toward such attributes, resulting in improper estimates of distance.

Hence, the presence of irrelevant and redundant attributes can adversely affect the performance of nearest neighbour classifiers.

27
Q

8 Characteristics of Nearest Neighbour Classifiers

Preprocessing steps

A

Nearest neighbour classifiers can produce wrong predictions unless the appropriate proximity measure and data preprocessing steps are taken.

If the scale of the attributes are not taken into consideration, the proximity measure may be dominated by differences in the attributes with the widest range.

28
Q

Rote classification

A

A rote classifier is an example of lazy classification.

The classifier memorises the entire dataset without learning.

Then at the time when prediction is required, the rote classifier searches for an exact match of the instances presented for classification.