Machine learning Flashcards by Tio Adesola

What are the variable types? Draw the diagram

Discrete Continuous
/ \ |
Categorical Numerical Numerical
/ \ | |
Nominal Ordinal Interval Ratio

How well did you know this?

Not at all

Perfectly

Properties of attribute types

Nominal Ordinal Interval Ratio
Distinct ✅ ✅ ✅ ✅
Order ❌ ✅ ✅ ✅
Addition ❌ ❌ ✅ ✅
Multiplication ❌ ❌ ❌ ✅

How well did you know this?

Not at all

Perfectly

What is global standardisation?

Feature scaling to ensure all features have the same mean and SD especially for larger features that dominate smaller features

How well did you know this?

Not at all

Perfectly

What is dimensionality reduction?

A technique to reduce the number of features in a dataset while preserving the relevant information (PCA)

How well did you know this?

Not at all

Perfectly

What is feature selection?

Selects a subset of original features that are the most relevant which acts as feature removal. Reduces dimensionality of dataset

How well did you know this?

Not at all

Perfectly

What is feature extraction?

Identification of a reduced set of transformed features which contributes to a reduction of dimensionality in the dataset (PCA)

How well did you know this?

Not at all

Perfectly

Benefits of feature selection

Simplifies the model, reduces size of dataset, improves model accuracy, more efficient in training and easier to interpret

How well did you know this?

Not at all

Perfectly

Most important things to check for in data cleaning

Outliers, missing values and duplicates

How well did you know this?

Not at all

Perfectly

Benefits of sampling

It allows for quicker analysis when analysis on whole dataset is not feasible

How well did you know this?

Not at all

Perfectly

Types of sampling

Subsampling, sampling, re sampling, random sampling and stratified random sampling

How well did you know this?

Not at all

Perfectly

What is subsampling

Used for data reduction by selecting a subset of original dataset

How well did you know this?

Not at all

Perfectly

What is sampling?

Creation of training and testing datasets

How well did you know this?

Not at all

Perfectly

What is resampling?

Repeatedly drawing samples to estimate the characteristics of the whole dataset. Used for bias removal (bootstrapping)

How well did you know this?

Not at all

Perfectly

What is random sampling?

Without replacement (pick balls out the bag)
With replacement (picking balls but putting them back)

How well did you know this?

Not at all

Perfectly

What is stratified random sampling?

Random samples are taken from each variable based on relevant features. Used to contribute to model performance

How well did you know this?

Not at all

Perfectly

What is the holdout method?

Study These Flashcards

Splitting the dataset into 2 parts and holdout part of the dataset and use the other part to train (test and train)

What is k fold cross validation?

Study These Flashcards

Evaluates model performance by dividing the dataset into k number of folds. Once one is used, the process is repeated k times with a different combination each timw

What is bootstrapping?

Study These Flashcards

Create repetitions of the sets using random sampling generally > 1000 which is used to reduce bias and make dataset more robust

Types of model performance evaluation

Study These Flashcards

Regression, classification and binary classification

How do you do regression performance evaluation?

Study These Flashcards

Mean squared error, root mean square and mean absolute error

How do you do classification performance evaluation?

Study These Flashcards

Actual class value vs predicted class value, confidence on prediction and confusion matrix

How do you do binary classification performance evaluation?

Study These Flashcards

Using class of interest instances falling into positive for the class and negative for the other classes

How to evaluate model performance when there’s a class imbalance for both regression and classification?

Study These Flashcards

Kappa statistic
Roc curve

What does the kappa statistic do?

Study These Flashcards

It adjusts accuracy for correct predictions by chance and partially mitigates the effects of class imbalance

K nearest neighbour benefits and disadvantages

Simple and easily explained and interpreted Fast in training Both classification and regression Doesn’t produce a model Slow in classification Difficult to select k

SVM benefits and disadvantages

High accuracy in prediction Compact model representation Not prone to overfishing Difficult to select kernel/parameters Slow to train for large datasets Not easily explained

Decision trees benefits and disadvantages

Human readable structural pattern Transparent model for numeric and nominal features Easily explained and interpreted Inefficient for high dimensional data Tendencies to overfit Sensitive to small perturbations

Random forests benefits and disadvantages

Efficient for high dimensional data Insensitive to noise/missing data Directly selects important features for both numerical/nominal features Not easily explained or interpreted More difficult to tune Biased for features with many levels

Neural networks benefits and disadvantages

Model complex patterns in data Both classification and regression No assumption of data relationships Not easily interpreted Prone to overfitting in training Computationally intense to train

Centre based clustering k means benefits and disadvantages

Efficient (linear complexity Automatic and no cut off required Depends on intial seeds Often trapped in local minima Inefficient for high dimensional data

Machine learning Flashcards

(30 cards)