Classification - Part 3 Flashcards

1
Q

What is the Naive Bayes classification? Name the method and goal

A
  • Probabilistic classification technique that considers each attribute and class label as random variables

Goal: Find the class C that maximizes the conditional probability

P(C|A) -> Probability of class C given Attribute A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When is the application of the Bayes Theorem useful?

A

Bayes Theorem: P(C|A) = (P(C|A)*P(C)) / P(A)

Useful situations:

  • P(C|A) is unknown
  • P(A|C), P(A) and P(C) are known or easy to estimate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Whats the difference between prior and posterior probability (Bayes Theorem)

A
  • Prior probability describes the probability of an event before evidence is seen
  • Posterior probability describes the probability of an event after evidence is seen
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you apply Bayes Theorem to the classification task?

A
  1. Compute probability P(C|A) for all values of C using Bayer Theorem
  2. a. Normalize the likelihood of the classes
  3. Choose value of C that maximizes P(C|A)
  • P(A) is same for all classes (so you can neglect it when comparing the probability of the classes)
  • Only need to estimate P(C) and P(A|C)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How to estimate the prior probability P(C)

A
  • Count the records in training set that are labeled with class C
  • Divide the count by overall number of records in training data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain the independence assumption and its implications for estimating P(A|C) for the Naive Bayes

A
  • Naive Bayes assumes that all attributes are statistically independent
    • This assumption is almost never correct
  • > This assumption allows the joint probability P(A|C) to be reformulated as the product of the individual probabilities P(Ai|Cj)
  • > The probabilities for P(A|C) can be estimated directly from the training data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to estimate the probabilities P(Ai|Cj)?

A
  1. Count how often an attribute value appears together with class Cj
  2. Divide the count by the overall number of records belonging to class Cj
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the names of the parts of the Bayes Theorem?

A
  • P(A|C) Class conditional probability of evidence
  • P(C) Prior probability of class
  • P(A) Prior probability of evidence
  • P(C|A) Posterior probability of class C
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to normalize the likelihood of the two classes (PC|A) ?

A
  • Divide the classes probability by the sum of the probability of all classes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How should you handle numerical attributes when applying the Naive Bayes?

A

Option 1) Discretize the numerical attributes (apply categories to the numerical values)

Option2) Assume that numerical attributes have a normal distribution given the class

  • estimate distribution parameters by training data
  • with the probability distribution you can estimate the conditional probability P(A|C)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which distribution parameters can be estimated from the training data?

A
  • Sample mean

- Standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to handle missing values in the training data?

A
  • Dont include the records into the frequency count for attribute value-class combination (just pretend that this record does not exist)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to handle missing values in the test data?

A

Attribute will be omitted from calculation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain the zero-frequency problem

A
  • If an attribute value does not occur with every class value the posterior probability will also be zero!
Solution: Laplace Estimator 
Add 1 to the count of every attribute class combination

LaPlace: P(Ai|C) = (Nic + 1) / (Nc + |Vi|)

Vi = number of values for the attribute in the training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the characteristics of Naive bayes?

A
  • Works well because classification only that the maximum probability is assigned to correct class (even if the wrong independence assumption can lead to accurate probability estimates)
  • Robust to isolated noise points (averaged out)
  • Robust to irrelevant attributes (P(Ai|C) distributed uniformly for Ai)
  • Redundant attributes can cause problems -> use subset of attributes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the technical advantage of Naive Bayes?

A
  • Learning is computationally cheap because the probabilities can be calculated by one pass over the training data
  • Storing the probabilities does not require a lot of memory
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

For which problems can you use Support Vector Machines ?

A
  • Two class problems

- Examples described by continuous attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When doe SVMs achieve good results?

A
  • For high dimensional data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How does SVMs work?

A
  • They find a linear hyperplane (decision boundary that separates the data
20
Q

How does a SVM find the best Hyperplane?

A
  • To avoid overfitting and to generalize better for unseen data the hyperplane that maximizes the margin to the closest points (support vectors) is chosen
21
Q

How to deal with noise points in SVMs?

A
  • Use slack variables for margin computation
  • Slack variables indicate if a record is used or ignored; result in a penalty for each data point that violates the decision boundary

Goal: Have a large margin without ignoring too many data points

22
Q

How to handle decision boundaries that are not linear with SVMs?

A
  • Transform data into higher dimensional space where there is a linear separation
  • Different kernel function can be used for this transformation
23
Q

What are the characteristics of SVMs ?

A
  • Most successful classification technique for high dimensional data before DNNs appeared
  • Hyperparameter selection often has a high impact on the performance of SVMs
24
Q

What are the application areas for SVMs?

A
  • Text classification
  • Computer vision
  • Handwritten digit recognition
  • SPAM detection
  • Bioinformatics
25
Q

How do you tune a SVM?

A

1) Transform attributes to numeric scale
2) Normalize all value ranges (0 to 1)
3) RBF kernel function
4) Use nested cross-validation to find the best values for the parameters
- C (weight of slack variable) 0.03 to 30000
- gamma (kernel parameter) 0.00003 to 8

26
Q

What is the inspiration of an ANN?

A

The humans brain

27
Q

How do Artificial Neural Networks classify records?

A
  • The model consists of inter-connected nodes (neurons) and weighted links
  • Output node sums up each of the input values

Classification decision: Compare output node against threshold t

Y= I(SUM(wi*Xi) - t > 0)
Y = true -> class 1
Y = false -> class 0
28
Q

Describe the multi-layer ANN

A
  • Input layer
  • Hidden layer (Training ANN mean it learns the weights of the neurons (of the hidden layer))
    • For each neuron, activation function (threshold)
      determines the output
29
Q

Describe the algorithm for training ANNs

A
  1. Initialize weights (1 or random)
  2. Adjust the weights so that the output of ANN is as consistent as possible with class labels of training set
    • Find weights that minimize the error
    • Back propagation algorithm
    • Adjustment factor: learning rate
30
Q

Describe Deep Neural Networks

A
  • They differ from ANN by the number of layers (deep)

- Require lots of training data & GPU to determine weights and to test different network architectures

31
Q

What are the characteristic of ANN?

A
  • Can be used for classification & numerical regression
  • Multi-layer neural networks are universal approximators
  • Model building is time consuming; application is fast
  • Can handle redundant attributes, difficult to handle missing values (ANN does not know which input should be assigned to missing values)
32
Q

What is a difficult task when applying ANNs?

A
  • Choose the right network topology
  • Expressive space often leads to overfitting
    • Use more training data (a lot more)
    • Step-by-step simplification of the topology (regularization)

Regularization:

1) Start with several hidden layers and larger number of nodes
2) Estimate generalization error using validation dataset
3) Step by step remove nodes as long as generalization error improves

33
Q

What the difference between a hyperparameter and a parameter

A

Hyperparameter: influences the learning process and the value is set before the learning begins (pruning thresholds for tree, k for k nearest neighbor)

Parameter: Is learned from the training data (weights in ANN, splits in a tree)

34
Q

How can you determine good hyperparameters?

A
  • manually play around with different settings

- have your machine automatically test many different settings (hyperparameter optimization)

35
Q

What is the goal of hyperparameter optimization?

A
  • Find combination of hyperparameters that results in learning the model with the lowest generalization error

Select the model that is expected to generalize best on unseen records

36
Q

How to determine the hyperparameter value combinations to be tested?

A
  • Grid Search (test all combinations in user-defined range)
  • Random Search (test combinations of random values)
  • Evolutionary Search (keep specific values that worked well)
37
Q

How to select the model with a validation set?

A
  • Keep data used for model selection strictly separate from data used for model evaluation (otherwise overfit)
  1. Split training set into validation and training set
  2. Learn multiple models with different hyperparameter values
  3. Select best parameter values by testing each model on the validation set
  4. Learn final model on complete training data (before split)
  5. Evaluate best model on test set
38
Q

Why do we need cross-validation for model selection?

A
  1. We want that all examples are used for validation once

2. use as much labeled data as possible for training

39
Q

How many models are learned using cross-validation for model selection?

A

Model selection = |folds| * parameters value sets
+
best model on complete training data

40
Q

Why do we need nested cross-validation?

A

1) To find the best hyperparameter setting (model selection)
2) To get a reliable estimate of the generalization error (model evaluation)

Cross-validation for model selection does not incorporate an estimation of the generalization performance

41
Q

How does nested cross-validation work:

A

Outer Cross-Validation (Model evaluation)

  • estimates generalization error of best model
  • training set is passed on to inner cross-validation in each iteration

Inner cross-validation (Model selection)

  • searches for best parameter combination
  • splits outer training set into inner training and validation set
  • learns model m best using all outer training data
42
Q

Which classifiers select relevant features automatically?

A
  • Decision trees, Random Forests, ANNs, SVMs
43
Q

Which classifiers have their performance in dependence of the feature subset?

A
  • KNN, Naive Bayes
44
Q

Which automated feature selection approaches do you know?

A
  • Forward selection: find best single feature, add further features and test again
  • Backward selection: start using all features, remove features, test again

Use nested cross-validation to estimate the generalization error

45
Q

When should you use hyperparameter selection?

A
  • Always !

Otherwise you cant say that a method does not work for a task

46
Q

When should you apply feature selection?

A
  • Check if classification method requires feature selection

- If yes, run automated feature selection

47
Q

Which set-up should be used for model selection?

A
  • Nested cross-validation
  • If computation takes too long:
    • better hardware
    • reduce number of folds
    • reduce parameter search space
    • sample data to reduce size
  • If exact replicability of results is required: single train, validation, test split