Classification - Part 1 Flashcards

1
Q

What is the Goal of Classification

A

Previously unseen records should be assigned a class from a given set of classes as accurate as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain the Approach of Classification

A
  • You have a training set consisting of
    -attributes
    • a class attribute (label) with positive and negative
      examples
  • Class attribute should be predicted based
  • Learn a model as a function of the values of the other attributes to predict the class attribute
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which variants of Classification exist

A
  • Binary classification (fraud / no fraud)
  • Multi-class classification (low, medium, high)
  • Multi-label classification (more than one class per record e.g. user interests)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the steps in the Model Learning and Model Application Process?

A

Training Set -> (Induction) -> Learn Model -> Model

-> Apply Model -> Deduction -> Unseen records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Give some classification examples

A
  • Credit Risk Assessment
    • Attributes: age, income, debts
    • Class: Credit (yes/no)
  • Marketing:
    • Attributes: previously bought products
    • Class: Target Customer (yes/no)
  • SPAM detection:
    • Attributes: words and header fields of e-mail
    • Class: Spam (yes/no)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

List seven classification techniques

A
  • K-nearest-neighbours
  • Decision Trees
  • Rule Learning
  • Naive Bayes
  • Support Vector Machines
  • Artificial Neural Networks
  • Deep Neural Networks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain the approach of K-nearest-neighbor

A

It requires:

  • A set of stored records
  • Distance measure
  • k value (number of nearest neighbors to consider)

Approach:
For each unknown record:
1. Compute distance to each training record
2. Identify k-nearest neighbors
3. Use class labels of nearest neighbors to determine class of unknown record
- majority vote or
- weighting vote according to distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a k-nearest neighbor ?

A

If you have a unknown record x the k-nearest neighbors are data points that have the k smallest distance to x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to choose a good value for K?

A

Role of Thumb: Test k values between 1 and 20

If k too small: result is sensitive to noise points
If k is too large: neighborhood may include points from other classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the pros and cons of k-nearest-neighbor classification

A

+ Often very accurate (for optical character recognition)
- but slow (unseen records are compared to all training examples)

Neutral aspects:

  • Results depend on choosing a good proximity measure
  • KNN can handle decision boundaries which are not parallel to the axes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe Lazy Learning

A
  • Instance based learning approaches
  • No explicit model is learned
  • -> Less time to learn but long time to classify

Single Goal: Classify unseen records as accurately as possible

Example: KNN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe Eager Learning

A
  • Eager learning generate models that might be interpretable by humans
  • -> Longer time for learning but less time to classify

Compared to lazy learning, eager learning has two goals:

  • classify unseen records
  • understand the application domain as a human

Examples: decision tree learning, rule learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the components of a decision tree classifier

A
  • Root of tree
  • Attribute tests (splitting attributes)
  • Leaf nodes (decision - no tests / splits)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why are the decision boundaries of a decision tree parallel to the axes?

A

Because the test condition involves a single attribute at-a-time

-> It is a step by step division of the space into areas parallel to the axes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How to learn a decision tree from training data?

A
  1. Training data with attributes and class attribute
  2. Apply Learning Algorithm
  3. Decision Tree Model
  • Finding optimal decision tree is NP-hard
  • Algorithms use greedy, top-down, recursive partitioning to induce a reasonable solution

Example Algorithms

  • Hunts Algorithm
  • ID3
  • CHAID
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain Hunt’s algorithm (decision tree)

A
  • You have a training set and need to generate either leaf nodes or attribute tests
    Dt: set of training records that reach a node t
O1 ) If Dt only contains records of the same class then t is a leaf node.
O2 ) If Dt contains records with more than one class use attribute test in order to split data into subsets with a higher purity

Recursively apply procedure to each subset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the design issues for learning decision trees?

A
  1. How should training records be split?
    How to specify attribute test condition
    - Depends on number of splits: Binary, Multi-way split
    - Depends on attribute data type: nominal, ordinal, continuous
    How to determine the best split
    - Can use different purity measures
  2. When should the splitting procedure stop?
    - Shallow trees might generalize better to unseen records
    - Fully grown trees might overfit training data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How can you split nominal attributes in decision trees

A
  • Multi-way split: use as many partitions as distinct values

- Binary split: Divides values into two subsets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How can you split ordinal attributes in decision trees

A
  • Multi-way split: use as many partitions as distinct values

- Binary split: Divides values into two subsets while keeping the order

20
Q

How can you split continuous attributes in decision trees

A
  • Discretization: form an ordinal categorical attribute
  • equal-interval binning
  • equal-frequency binning
  • user-provided boundaries binning
  • Binary Decision: A > v or A <= v
  • often sufficient in practice
  • find best splitting border v based on purity measure
  • compute intensive
21
Q

Explain equal-interval binning (decision tree splitting based on continuous attributes)

For values of the attribute age of a person:
0, 4, 12, 16, 16, 18, 24, 26, 28

A

Specify a bin width e.g. 10
Bin 1: [-,10)
Bin 2: [10,20]
Bin 3: [20,+]

22
Q

Explain equal-frequency binning (decision tree splitting based on continuous attributes)

For values of the attribute age of a person:
0, 4, 12, 16, 16, 18, 24, 26, 28

A

Specify a din density e.g. 3:

Bin 1: [-,14]
Bin 2: [14,21]
Bin 3: [21,+]

23
Q

How to find the best attribute split for decision trees?

A

Greedy approach: Test all splits and choose the one with the most homogenous (pure) nodes (the one with the lowest impurity measure after splitting)

Requires: measure of node impurity

Common Measure:

  • GINI Index
  • Entropy
24
Q

Which bucket has the higher degree of impurity?

C0: 5 C0:9
C1: 5 C1: 1

A

The left one

25
Q

How is the purity gain defined?

A

Gain = P - M

P = impurity measure before splitting
M = impurity measure after splitting
26
Q

How is the GINI Index calculated?

A

1 - the sum of the squared relative frequency of a class j at node t

27
Q

How to interpret the GINI Index

A

Minimum: 0.0 (all records belong to one class)
Maximum: 1-(1/nc) (if records are equally distributed among all class)
nc = number of classes

For two classes: 1-1/2 = 1/2

Gini increases if the buckets are less pure

28
Q

What is the difference between the GINI index and the GINI split

A

GINI Index: Measure for node impurity

GINI Split: Measures the quality of the overall split (GINI Index of each partition is weighted according to its size)

29
Q

How to compute the GINI Index for categorical attributes

A

For each distinct attribute value count for each class

30
Q

How to find the best binary split for continuous attributes by using the GINI index

A
  1. Sort the attribute on values
  2. linearly scan the values and update count matrix and compute GINI index
  3. choose the split position with smallest GINI index
31
Q

What is Entropy?

A

An alternative impurity measure.

Entropy measure the homogeneity of a node.

Minimum: 0.0 when all records belong to one class
Maximum: log2 nc when records are equally distributed among all classes
32
Q

How to split nodes based on information gain (with entropy)?

A
  • Information gain measure the entropy reduction of a split

- Choose the split with the largest reduction (maximal GAIN -> partition with the smallest entropy)

33
Q

What is the disadvantage of splitting based on information gain (entropy)

A

Tends to prefer splits that result in large number of partitions; each partition is small but pure

like by ID attribute

34
Q

What is GainRATIO

A
  • Designed to overcome the tendency of Information Gain to generate a large number of small partitions
  • GainRatio adjusts information gain by entropy of partitioning (SplitINFO)

Penalizes large number of small partitions (higher entropy of the partitioning)

35
Q

What is the difference between GINI Index and Entropy

A

Gini Index measure impurity

Entropy measures information gain

36
Q

What does overfitting mean

A

A model trains patterns that are specific to the training data set thus the model performs poor on unseen (test) data

Goal: Find a compromise between a specific and a general model

37
Q

What are symptoms and causes of an overfitted tree?

A

Symptoms:

  • Decision tree is too deep
  • Too many branches
  • Model works well on training set but performs bad on test set

Causes:

  1. Too little training data
  2. Noise / outliers in training data
  3. High model complexity

An overfitted model does not generalize well to unseen data.

38
Q

Describe the characteristics of underfitting and overfitting.

A

Underfitting: Model is too simple; both training an test errors are large
Overfitting: Model is too complex; training error is small but test error is large

39
Q

How can you prevent Overfitting?

A
  • Use more training data
  • Pre-Pruning
  • Post-Pruning
  • Ensembles
40
Q

How can you prevent overfitting with more training data?

A
  • If training data is under-representative, testing errors increase on increasing number of nodes
  • With more training data you can reduce the difference between training and testing errors at a given number of nodes

Expensive prevention method

41
Q

How can you prevent overfitting with pre-pruning?

A

Stop the algorithm before tree becomes fully grown because shallower trees potentially generalize better (Occams razor)

Early stopping conditions = Pre-pruning:

  • Number of instances in leaf node below threshold
  • Impurity improvement below a threshold
42
Q

How can you prevent overfitting with post-pruning?

A
  1. Grow decision tree fully
  2. Trim nodes of the decision tree bottom up
  3. Estimate generalization error before and after trimming (with Test data)
  4. If generalization error improves after trimming
    • replace subtree (by leaf node or most frequent used
      branch)
43
Q

How to prevent overfitting with ensembles

A
  • Learn different models
  • Each model votes on the final classification decision

Idea: Wisdom of the crowds

  • A single classifier might focus too much on one aspect
  • Multiple classifiers can focus on different aspect
44
Q

What are random forests?

A

Ensemble consisting of a large number of different decision trees. They usually outperform single decision trees.

45
Q

How can you achieve independence of trees at a random forest

A

You need to use randomness in the learning process:

  • Randomly use different attributes of the training set for each tree
  • Randomly remove records from the training data (bagging)
46
Q

What are the advantages of decision trees?

A

Advantages:

  • Inexpensive to construct
  • Very fast at classifying unseen records
  • Easy to interpret for small trees
  • Can handle redundant or irrelevant attributes (good in feature selection)
  • Good accuracy for low dimensional data sets (not texts and images)
47
Q

What are the disadvantages of decision trees?

A

Disadvantages:

  • Space of possible decision trees is exponentially large (Greedy approaches are often unable to find the best tree)
  • Trees don’t take into account interactions between attributes