Python ML Principles Flashcards
Learn the main steps and sub-steps of ML.
What are the four main steps in ML?
Visualisation
Cleaning and Transformation
Construction of ML model
Evaluation of ML model
What are the two main sub-steps of the Cleaning and Transformation step?
1) Data Preparation & Cleaning
2) Feature Engineering
What should you do before starting Preparation & Cleaning?
Explore the data to understand the issues that are present.
What are six sub- steps of the Data Preparation & Cleaning step?
- Recode chr strings to eliminate unrecognised characters
- Find & treat missing values
- Set correct data type and column
- Transform categorical features to increase cases
- Apply transformation to numerics to improve distributions
- Duplicate management
What’s another name for “transformation to improve distribution” ?
Feature engineering
Name a common transformation.
Log
What is the main thing we’re trying to achieve with Feature Engineering?
We’re trying to achieve distinct separation of the labelled cases, indicating better prediction.
What is a test used to evaluate linear regression accuracy?
Sum of squared errors.
Why is linear regression sometimes called Least Squares Regression?
Because it creates a line that minimises the square of variance (error) from the line.
With linear regression in scikit.learn what Python package should you use for your arrays?
Numpy
What are the 4 main steps for linear regression with scikit.learn?
Layout numpy arrays
Scale
Specify model object
Fit
What visualisation could you use to evaluate residuals of a regression model?
A histogram of residuals.
What two residuals histogram pattern indicate an accurate model?
- Clustering of residuals around zero.
2. Normal distribution.
If you see a multimodal residuals histogram what should you do about the non-zero modes?
Investigate what’s creating them and consider adding these features to your model.
In scikit learn, what is onehot encoding?
Conversion of multiple feature options to a numpy array, where only one row for the record shows a “1”
In classification ML, what is the space between the two outcomes called?
The decision boundary.
Should be zero, with Y/N result giving -ve +ve values.
What is a loss function?
A weighting line in classification that describes how much incorrect labels are to be penalised.
Name four ways to check if your model is CRAP.
Confusion matrix (TP,FP,TN,FN)
ROC curves
Accuracy/misclassification error
Precision/Recall/F1
What is on the y-axis and what on the x-axis of an ROC curve?
y-axis = TPR x-axis = FPR
Name 4 techniques to address imbalanced data.
1) Undersample majority
2) Oversample minority
3) Case weights
4) Impute
Name one method of dimensionality reduction.
Principal component analysis.
What data does Principal Component Analysis produce on the dataset features?
Variability in the label explained by each feature.
What is the goal of regularisation?
Prevent overfitting of ML models.
What is the tradeoff dilemma in Regularisation?
Bias vs Variance
Regularisation reduces variance, but can introduce what?
Bias
A model that has high variance and low bias is what?
Overfit
The diagonal line on a QQ Residual vs Predicted plot represents what?
A perfect Normal Distribution.
What’s another name for L2 regularisation?
Ridge Regression
What does L2 regularisation do to the coefficients?
Constrains them, by driving coefficients close to zero.
What are two other names for L1 regularisation?
Lasso method.
Manhattan norm.
What is k-fold cross-validation?
Resampling random subsets and repeating training calculations, while holding 1 subset back for testing.
With decision trees, what kind of split creates less entropy?
A split where there are very uneven probabilities of the outcomes.
Entropy is high = 1 if p = 0.5.