[1] Machine Learning Fundamentals Flashcards
What are the stages of the machine learning lifecycle?
(1) Process data
(2) Split the data
(3) Training
(4) Test
How does the ‘process data’ stage of the ML lifecycle work?
Data is put into a machine-readable format and undergoes feature engineering and/or dimensionality reduction
How does the ‘split data’ stage of the ML lifecycle work?
Data is separated into the training data to train the weights, validation data to guide the training process, and testing data to evaluate the model
How does the ‘training’ stage of the ML lifecycle work?
The training data is used directly to train the model parameters, guided by the validation data
How does the ‘test’ stage of the ML lifecycle work?
The test data is used to evaluate how well the model is likely to perform in the real world
What kinds of summary statistics are considered during EDA?
Overall statistics - these describe the overall dataset e.g. how many instances and features
Attribute statistics - describe individual features i.e. their average
Multivariate statistics - describe relationships between features
What is the difference between semantic segmentation and instance segmentation?
Semantic segmentation classifies pixels while instance segmentations finds distinct objets of that class as pixel groups
Why is unsupervised learning useful for finding relationships within the data?
It doesn’t require knowing the classes in the dataset up-front
What is the purpose of regularisation?
It de-sensitises the model to the data, allowing it to avoid overfitting and better handle outliers
What are the key kinds of regularisation?
L1 regularisation is Lasso regression
L2 regularisation is Ridge regression
What does ‘stochastic batch learning’ refer to?
Using only 1 sample in each batch
What is cross-validation?
Which data is used for training and validation is rotated, preventing data from being lost in the training phase
How should features be selected?
- Use domain knowledge to drop irrelevant information
- Drop features with low correlation to the response (but be careful fo correlations)
- Drop features with very low or very high variance
- Drop features with lots of missing values or errors, unless this is relevant
What steps are there to feature engineering?
- Simplify features i.e. give MBI instead of height and weight
- Standardise the scale of the data to [0, 1]
- Transform the features to suit the problem i.e. conversion timestamps to time of day
How can unbalanced data be addressed?
- Source more data
- Oversample minority data or weight it more strongly
- Synthesise new data - consider what can be varied without changing the label
- Try different algorithms - some are less susceptible to missing data than others
Why is is important to always do before splitting?
Shuffle the data to prevent data clumping etc.
How you categorical features be encoded?
Label encoding with a look-up table if they are ordinal, otherwise one-hot encoding
How is dimensionality reduction performed?
PCA or t-distributed stochastic neighbour encoding
Note: clustering does NOT help - it is unsupervised
What are the steps for performing PCA?
(1) Find the centroid
(2) Draw a minimum bounding box such that none of its sides are parallel to the axes
(3) take the longest diagonal as the biggest variance (PC1) and the second longest as PC2 and so on