Machine learning Flashcards
What are the variable types? Draw the diagram
Discrete Continuous
/ \ |
Categorical Numerical Numerical
/ \ | |
Nominal Ordinal Interval Ratio
Properties of attribute types
Nominal Ordinal Interval Ratio
Distinct ✅ ✅ ✅ ✅
Order ❌ ✅ ✅ ✅
Addition ❌ ❌ ✅ ✅
Multiplication ❌ ❌ ❌ ✅
What is global standardisation?
Feature scaling to ensure all features have the same mean and SD especially for larger features that dominate smaller features
What is dimensionality reduction?
A technique to reduce the number of features in a dataset while preserving the relevant information (PCA)
What is feature selection?
Selects a subset of original features that are the most relevant which acts as feature removal. Reduces dimensionality of dataset
What is feature extraction?
Identification of a reduced set of transformed features which contributes to a reduction of dimensionality in the dataset (PCA)
Benefits of feature selection
Simplifies the model, reduces size of dataset, improves model accuracy, more efficient in training and easier to interpret
Most important things to check for in data cleaning
Outliers, missing values and duplicates
Benefits of sampling
It allows for quicker analysis when analysis on whole dataset is not feasible
Types of sampling
Subsampling, sampling, re sampling, random sampling and stratified random sampling
What is subsampling
Used for data reduction by selecting a subset of original dataset
What is sampling?
Creation of training and testing datasets
What is resampling?
Repeatedly drawing samples to estimate the characteristics of the whole dataset. Used for bias removal (bootstrapping)
What is random sampling?
Without replacement (pick balls out the bag)
With replacement (picking balls but putting them back)
What is stratified random sampling?
Random samples are taken from each variable based on relevant features. Used to contribute to model performance
What is the holdout method?
Splitting the dataset into 2 parts and holdout part of the dataset and use the other part to train (test and train)
What is k fold cross validation?
Evaluates model performance by dividing the dataset into k number of folds. Once one is used, the process is repeated k times with a different combination each timw
What is bootstrapping?
Create repetitions of the sets using random sampling generally > 1000 which is used to reduce bias and make dataset more robust
Types of model performance evaluation
Regression, classification and binary classification
How do you do regression performance evaluation?
Mean squared error, root mean square and mean absolute error
How do you do classification performance evaluation?
Actual class value vs predicted class value, confidence on prediction and confusion matrix
How do you do binary classification performance evaluation?
Using class of interest instances falling into positive for the class and negative for the other classes
How to evaluate model performance when there’s a class imbalance for both regression and classification?
Kappa statistic
Roc curve
What does the kappa statistic do?
It adjusts accuracy for correct predictions by chance and partially mitigates the effects of class imbalance
K nearest neighbour benefits and disadvantages
Simple and easily explained and interpreted
Fast in training
Both classification and regression
Doesn’t produce a model
Slow in classification
Difficult to select k
SVM benefits and disadvantages
High accuracy in prediction
Compact model representation
Not prone to overfishing
Difficult to select kernel/parameters
Slow to train for large datasets
Not easily explained
Decision trees benefits and disadvantages
Human readable structural pattern
Transparent model for numeric and nominal features
Easily explained and interpreted
Inefficient for high dimensional data
Tendencies to overfit
Sensitive to small perturbations
Random forests benefits and disadvantages
Efficient for high dimensional data
Insensitive to noise/missing data
Directly selects important features for both numerical/nominal features
Not easily explained or interpreted
More difficult to tune
Biased for features with many levels
Neural networks benefits and disadvantages
Model complex patterns in data
Both classification and regression
No assumption of data relationships
Not easily interpreted
Prone to overfitting in training
Computationally intense to train
Centre based clustering k means benefits and disadvantages
Efficient (linear complexity
Automatic and no cut off required
Depends on intial seeds
Often trapped in local minima
Inefficient for high dimensional data