L7: Supervised machine learning Flashcards

Question

K-NEAREST NEIGHBOUR (KNN)

Answer 1

KNN is an algorithm that works locally because it uses a pre-specified number of observations (k = the number of nearest neighbours) to make the prediction. For regression, the average score is used whereas, in classification, the majority always wins. ü Regression ü Classification

Answer 2

K-NEAREST NEIGHBOUR (KNN)

Answer 3

Decision Trees are a global approach that use all observations to make a prediction. The tree-like structure shows that the functional form f(x) is approx. in a step-wise manner by means of recursive binary splitting. ü Regression ü Classification

Answer 4

DECISION TREES

Answer 5

SUPPORT VECTOR MACHINES

Answer 6

SUPPORT VECTOR MACHINES

Answer 7

The goal is to derive associations and patterns based on a selection of input variables without knowing the target (outcome variable) i.e., we have no ground truth. Requires: * A range of input variable * No outcome variable

Answer 8

UNSUPERVISED ML

Answer 9

A simplified representation of reality created for a specific purpose based on some assumptions. Example: Customer churn * Create a “formula” for predicting the probability of customer attrition at contract expiration

Answer 10

1. Consider the domain and your problem statement 2. Consider the requirement for explainability 3. Choose the type of algorithm 4. Establish success criteria i.e., definition of success 5. Train models 6. Model selection

Answer 11

Curse of dimensionality refers to the situation where we keep on adding more input variables to our data, which creates high-dimensional data. High-dimensional data = # of input variables ≥ # of observations The amount of training data needs to grow exponentially to maintain the same coverage!

Answer 12

“Black box” ML models are too complex for humans to understand or interpret. A limitation some ML algorithms suffer from, but not all! * A complex decision process made by the algorithm * Difficult to trace back from the predictions to the origin * Hard to determine why an action was taken * Model parameters that are non-interpretable Think carefully about explainability (Can your stakeholders understand the results of the chosen model?). In general, it is good practice to use simpler and more interpretable models when there is no significant benefit gained from choosing a more complex alternative, an idea also known as Occam’s Razor.

Answer 13

When you learn patterns in the training data that only are there by chance i.e., not present in new unseen data. Non-parametric and non-linear models are prone to overfitting because they have more flexibility when they approximate the functional form of f(x).

Answer 14

Overfitting

Answer 15

When you do not learn important patterns in the training data nor important generalizable patterns in new unseen data. It will be obvious from the chosen performance metric (training data) and the remedy is to move on and try to estimate alternate models.

Answer 16

Underfitting

Answer 17

Prediction Error x Model complexity

Answer 18

The goal: Split the data into a training data set and a test data set. Why? - Training a model and predicting with is are two separate things. - Avoid prediction bias when assessing the accuracy of the model. Requirements: * Independent (observations are independent of each other) * Mutually exclusive (an observation appears in only one of the two sets) * Completely exhaustive (all observations are allocated)

Answer 19

Random split with 80% train (in-sample) and 20% test data (out-of-sample) * Stratified random splitting * Train data set / Validation (tuning) data set / Test data set * Cross-validation * Leave-One-Out * K-fold Not enough data? - Use a resampling technique e.g., Boot-strapping

Answer 20

How successful is the chosen algorithm? To measure this, you need to choose an objective (loss) function that represent your goal. Examples: * Mean Squared Error (MSE): The average of sum of the squared difference between your predictions and your actual observations. * Mean Absolute Error (MAE): The average of sum of absolute differences between predictions and your actual observations. * Misclassification rate: The number of incorrect predictions out of the total number of predictions.

Answer 21

The goal is to establish different versions (candidate models) of the basic model by tuning the hyperparameters. Hyperparameters are parameters that are not a part of the model but impacts the training of the model (e.g., the k in KNN, the depth of a decision tree, or C and γ in a radial kernel for SVMs). How to fine-tune the hyperparameters? * Run agrid-search & do k-fold cross-validation or use the validation set

Answer 22

When selecting the final model (model selection), we look at the fitted candidate models to choose the best one based on the in-sample error calculated based on the data points used in the training process.

L7: Supervised machine learning Flashcards

(46 cards)