7: Machine Learning 2 Flashcards
What are two models for supervised learning?
- KNN classification
- Decision trees
What is the K-nearest neighbours (KNN) algorithm?
A supervised machine learning algorithm that classifies a new data point into the target class, depending on the features of its neighbouring data points.
- Assumptions: the new observation is likely from the same group as its closest neighbours, the data point that have the most similar characteristics.
- Measure distance with the Euclidian distance
What are the steps in using the KNN algorithm?
- Choose the number K of nearest neighbours (rule of thumb: K = sqrt(n), where n is the sample size).
- For any unlabelled training point (observation, know values but not label), identify the K nearest labelled points (shortest distance).
- Assign to the unlabelled point the most frequent label of the K nearest labelled point.
What is Kappa in the KNN output?
Kappa = (Accuracy - random accuracy) / (1 - random accuracy)
A negative value would mean that random assignment would do better than our prediction.
What are decision trees used for?
Trees are a way to split data into purer subsets. We try to find the easiest way to group data in the most homogeneous categories, iteratively.
What does the process of decision trees look like?
- Find the best split into the data: the split leading to the purer subgroups.
- Split the data into branches.
- Do it again at the next level.
- Start with root node (containing all training data).
- Aims to stop at leaf nodes (final decision, pure or almost pure classes)
- Pure class = only contains data from the same group
- internal nodes = intermediary nodes
How do we find the best split for decision trees?
The Gini impurity.
- Try all possible splits in our data at this level.
- For each possible split, calculate the average level of impurity (gini).
- Best split = lowest impurity. Always try to minimise Gini impurity.
What is a limitation of the Gini impurity?
It is computed only for the possible splits of one node, so it can only reach a local minimum (algorithm might not identify the optimal tree overall).
What are some limitations of decision trees classifiers?
- The tree is only optimal locally.
- Small changes in the training dataset might lead to large changes in the classification rules.
–> Decision trees are unstable!