7: Machine Learning 2 Flashcards

1
Q

What are two models for supervised learning?

A
  • KNN classification
  • Decision trees
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the K-nearest neighbours (KNN) algorithm?

A

A supervised machine learning algorithm that classifies a new data point into the target class, depending on the features of its neighbouring data points.

  • Assumptions: the new observation is likely from the same group as its closest neighbours, the data point that have the most similar characteristics.
  • Measure distance with the Euclidian distance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the steps in using the KNN algorithm?

A
  1. Choose the number K of nearest neighbours (rule of thumb: K = sqrt(n), where n is the sample size).
  2. For any unlabelled training point (observation, know values but not label), identify the K nearest labelled points (shortest distance).
  3. Assign to the unlabelled point the most frequent label of the K nearest labelled point.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Kappa in the KNN output?

A

Kappa = (Accuracy - random accuracy) / (1 - random accuracy)

A negative value would mean that random assignment would do better than our prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are decision trees used for?

A

Trees are a way to split data into purer subsets. We try to find the easiest way to group data in the most homogeneous categories, iteratively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the process of decision trees look like?

A
  1. Find the best split into the data: the split leading to the purer subgroups.
  2. Split the data into branches.
  3. Do it again at the next level.
  • Start with root node (containing all training data).
  • Aims to stop at leaf nodes (final decision, pure or almost pure classes)
  • Pure class = only contains data from the same group
  • internal nodes = intermediary nodes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do we find the best split for decision trees?

A

The Gini impurity.

  1. Try all possible splits in our data at this level.
  2. For each possible split, calculate the average level of impurity (gini).
  3. Best split = lowest impurity. Always try to minimise Gini impurity.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a limitation of the Gini impurity?

A

It is computed only for the possible splits of one node, so it can only reach a local minimum (algorithm might not identify the optimal tree overall).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some limitations of decision trees classifiers?

A
  1. The tree is only optimal locally.
  2. Small changes in the training dataset might lead to large changes in the classification rules.
    –> Decision trees are unstable!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly