Machine learning Flashcards

1
Q

What are the variable types? Draw the diagram

A

Discrete Continuous
/ \ |
Categorical Numerical Numerical
/ \ | |
Nominal Ordinal Interval Ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Properties of attribute types

A

Nominal Ordinal Interval Ratio
Distinct ✅ ✅ ✅ ✅
Order ❌ ✅ ✅ ✅
Addition ❌ ❌ ✅ ✅
Multiplication ❌ ❌ ❌ ✅

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is global standardisation?

A

Feature scaling to ensure all features have the same mean and SD especially for larger features that dominate smaller features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is dimensionality reduction?

A

A technique to reduce the number of features in a dataset while preserving the relevant information (PCA)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is feature selection?

A

Selects a subset of original features that are the most relevant which acts as feature removal. Reduces dimensionality of dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is feature extraction?

A

Identification of a reduced set of transformed features which contributes to a reduction of dimensionality in the dataset (PCA)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Benefits of feature selection

A

Simplifies the model, reduces size of dataset, improves model accuracy, more efficient in training and easier to interpret

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Most important things to check for in data cleaning

A

Outliers, missing values and duplicates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Benefits of sampling

A

It allows for quicker analysis when analysis on whole dataset is not feasible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Types of sampling

A

Subsampling, sampling, re sampling, random sampling and stratified random sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is subsampling

A

Used for data reduction by selecting a subset of original dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is sampling?

A

Creation of training and testing datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is resampling?

A

Repeatedly drawing samples to estimate the characteristics of the whole dataset. Used for bias removal (bootstrapping)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is random sampling?

A

Without replacement (pick balls out the bag)
With replacement (picking balls but putting them back)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is stratified random sampling?

A

Random samples are taken from each variable based on relevant features. Used to contribute to model performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the holdout method?

A

Splitting the dataset into 2 parts and holdout part of the dataset and use the other part to train (test and train)

17
Q

What is k fold cross validation?

A

Evaluates model performance by dividing the dataset into k number of folds. Once one is used, the process is repeated k times with a different combination each timw

18
Q

What is bootstrapping?

A

Create repetitions of the sets using random sampling generally > 1000 which is used to reduce bias and make dataset more robust

19
Q

Types of model performance evaluation

A

Regression, classification and binary classification

20
Q

How do you do regression performance evaluation?

A

Mean squared error, root mean square and mean absolute error

21
Q

How do you do classification performance evaluation?

A

Actual class value vs predicted class value, confidence on prediction and confusion matrix

22
Q

How do you do binary classification performance evaluation?

A

Using class of interest instances falling into positive for the class and negative for the other classes

23
Q

How to evaluate model performance when there’s a class imbalance for both regression and classification?

A

Kappa statistic
Roc curve

24
Q

What does the kappa statistic do?

A

It adjusts accuracy for correct predictions by chance and partially mitigates the effects of class imbalance

25
Q

K nearest neighbour benefits and disadvantages

A

Simple and easily explained and interpreted
Fast in training
Both classification and regression

Doesn’t produce a model
Slow in classification
Difficult to select k

26
Q

SVM benefits and disadvantages

A

High accuracy in prediction
Compact model representation
Not prone to overfishing

Difficult to select kernel/parameters
Slow to train for large datasets
Not easily explained

27
Q

Decision trees benefits and disadvantages

A

Human readable structural pattern
Transparent model for numeric and nominal features
Easily explained and interpreted

Inefficient for high dimensional data
Tendencies to overfit
Sensitive to small perturbations

28
Q

Random forests benefits and disadvantages

A

Efficient for high dimensional data
Insensitive to noise/missing data
Directly selects important features for both numerical/nominal features

Not easily explained or interpreted
More difficult to tune
Biased for features with many levels

29
Q

Neural networks benefits and disadvantages

A

Model complex patterns in data
Both classification and regression
No assumption of data relationships

Not easily interpreted
Prone to overfitting in training
Computationally intense to train

30
Q

Centre based clustering k means benefits and disadvantages

A

Efficient (linear complexity
Automatic and no cut off required

Depends on intial seeds
Often trapped in local minima
Inefficient for high dimensional data