Lecture 2 - Machine Learning Project Flashcards
What is NumPy Vectorisation?
Eliminating having to write loops by using NumPy functions.
What is an End-to-End Machine Learning Project?
1 Understand the problem and check assumptions.
2 Visualise and explore the data (also to support step 1).
3 Prepare the data for a ML algorithm (works in conjunction with step 2).
4 Select a model, train and validate it (can include fine-tuning).
5 Present your solution.
6 Launch, monitor, and keep checking assumptions.
End-to-End Machine Learning Project Diagram
REFER TO THE SLIDES - understand problem specification
What are common options to select as performance measures?
Mean Squared Error and Mean Absolute Error
Classification Example (Using MNIST dataset)
REFER TO SLIDES
What is the formula for accuracy?
Number of correct predictions / Total number of predictions
What is a confusion matrix?
Tells you the outcome of the classification using a 2 by 2 matrix, with a true label and predicted label
What is precision and what is its formula?
True positives / (True positives + False positives)
Where a:
True positive is the correctly predicted values
False positives is the values predicted, but are not the actual value
What is recall and what is its formula?
True positives / (True positives + False negatives)
Where a:
True positive is the correctly predicted values
False negative is the values predicted as that value, but its not actually that value
What are some trade offs between precision and recall
In some scenarios, false positives can be costly, so precision is more important.
- Predicting that it is safe to change lanes while driving, when it is not.
In some scenarios, false negatives can be costly, so recall is more important.
- Predicting that a patient does not have cancer when they do.
NOTE: you can use thresholds to manipulate which one you want
What is the F1 Score/Harmonic Mean
A single metric that combines both precision and recall
Formula: F1 = 2/ ((1/precision) + (1/recall))
What is a Receiver Operating Characteristic (ROC) curve
Receiver operating characteristic (ROC) curve, which plots the true positive rate (recall) against the false positive rate (FPR) for varying threshold settings.
Made up of:
TPR (also known as sensitivity and recall) = proportion of positive instances that are correctly classified as positives
Formula: TP / (TP+FN)
TNR (also known as specificity) = proportion of negative instances that are correctly classified as negatives
Formula: TN / (FP+TN)
FPR = proportion of negative instances that are incorrectly classified as positives
Formula: FP / (FP+TN) = 1 − specificity
What is multiclass Classification?
Multiclass classifiers are for discriminating between multiple classes (N > 2).
NOTE: Some algorithms (such as the Softmax Regression, Random Forest classifiers or naive. Bayes classifiers) are capable of handling multiple classes directly.
Others (such as Support Vector Machine classifiers) are strictly binary classifiers