Machine learning Flashcards
What are the variable types? Draw the diagram
Discrete Continuous
/ \ |
Categorical Numerical Numerical
/ \ | |
Nominal Ordinal Interval Ratio
Properties of attribute types
Nominal Ordinal Interval Ratio
Distinct ✅ ✅ ✅ ✅
Order ❌ ✅ ✅ ✅
Addition ❌ ❌ ✅ ✅
Multiplication ❌ ❌ ❌ ✅
What is global standardisation?
Feature scaling to ensure all features have the same mean and SD especially for larger features that dominate smaller features
What is dimensionality reduction?
A technique to reduce the number of features in a dataset while preserving the relevant information (PCA)
What is feature selection?
Selects a subset of original features that are the most relevant which acts as feature removal. Reduces dimensionality of dataset
What is feature extraction?
Identification of a reduced set of transformed features which contributes to a reduction of dimensionality in the dataset (PCA)
Benefits of feature selection
Simplifies the model, reduces size of dataset, improves model accuracy, more efficient in training and easier to interpret
Most important things to check for in data cleaning
Outliers, missing values and duplicates
Benefits of sampling
It allows for quicker analysis when analysis on whole dataset is not feasible
Types of sampling
Subsampling, sampling, re sampling, random sampling and stratified random sampling
What is subsampling
Used for data reduction by selecting a subset of original dataset
What is sampling?
Creation of training and testing datasets
What is resampling?
Repeatedly drawing samples to estimate the characteristics of the whole dataset. Used for bias removal (bootstrapping)
What is random sampling?
Without replacement (pick balls out the bag)
With replacement (picking balls but putting them back)
What is stratified random sampling?
Random samples are taken from each variable based on relevant features. Used to contribute to model performance