Python ML Principles Flashcards by George Gray

What are the four main steps in ML?

Visualisation
Cleaning and Transformation
Construction of ML model
Evaluation of ML model

How well did you know this?

Not at all

Perfectly

What are the two main sub-steps of the Cleaning and Transformation step?

1) Data Preparation & Cleaning

2) Feature Engineering

How well did you know this?

Not at all

Perfectly

What should you do before starting Preparation & Cleaning?

Explore the data to understand the issues that are present.

How well did you know this?

Not at all

Perfectly

What are six sub- steps of the Data Preparation & Cleaning step?

Recode chr strings to eliminate unrecognised characters
Find & treat missing values
Set correct data type and column
Transform categorical features to increase cases
Apply transformation to numerics to improve distributions
Duplicate management

How well did you know this?

Not at all

Perfectly

What’s another name for “transformation to improve distribution” ?

Feature engineering

How well did you know this?

Not at all

Perfectly

Name a common transformation.

Log

How well did you know this?

Not at all

Perfectly

What is the main thing we’re trying to achieve with Feature Engineering?

We’re trying to achieve distinct separation of the labelled cases, indicating better prediction.

How well did you know this?

Not at all

Perfectly

What is a test used to evaluate linear regression accuracy?

Sum of squared errors.

How well did you know this?

Not at all

Perfectly

Why is linear regression sometimes called Least Squares Regression?

Because it creates a line that minimises the square of variance (error) from the line.

How well did you know this?

Not at all

Perfectly

With linear regression in scikit.learn what Python package should you use for your arrays?

Numpy

How well did you know this?

Not at all

Perfectly

What are the 4 main steps for linear regression with scikit.learn?

Layout numpy arrays
Scale
Specify model object
Fit

How well did you know this?

Not at all

Perfectly

What visualisation could you use to evaluate residuals of a regression model?

A histogram of residuals.

How well did you know this?

Not at all

Perfectly

What two residuals histogram pattern indicate an accurate model?

Clustering of residuals around zero.

2. Normal distribution.

How well did you know this?

Not at all

Perfectly

If you see a multimodal residuals histogram what should you do about the non-zero modes?

Investigate what’s creating them and consider adding these features to your model.

How well did you know this?

Not at all

Perfectly

In scikit learn, what is onehot encoding?

Conversion of multiple feature options to a numpy array, where only one row for the record shows a “1”

How well did you know this?

Not at all

Perfectly

In classification ML, what is the space between the two outcomes called?

Study These Flashcards

The decision boundary.

Should be zero, with Y/N result giving -ve +ve values.

What is a loss function?

Study These Flashcards

A weighting line in classification that describes how much incorrect labels are to be penalised.

Name four ways to check if your model is CRAP.

Study These Flashcards

Confusion matrix (TP,FP,TN,FN)
ROC curves
Accuracy/misclassification error
Precision/Recall/F1

What is on the y-axis and what on the x-axis of an ROC curve?

Study These Flashcards

y-axis = TPR
x-axis = FPR

Name 4 techniques to address imbalanced data.

Study These Flashcards

1) Undersample majority
2) Oversample minority
3) Case weights
4) Impute

Name one method of dimensionality reduction.

Study These Flashcards

Principal component analysis.

What data does Principal Component Analysis produce on the dataset features?

Study These Flashcards

Variability in the label explained by each feature.

What is the goal of regularisation?

Study These Flashcards

Prevent overfitting of ML models.

What is the tradeoff dilemma in Regularisation?

Study These Flashcards

Bias vs Variance

Regularisation reduces variance, but can introduce what?

Bias

A model that has high variance and low bias is what?

Overfit

The diagonal line on a QQ Residual vs Predicted plot represents what?

A perfect Normal Distribution.

What's another name for L2 regularisation?

Ridge Regression

What does L2 regularisation do to the coefficients?

Constrains them, by driving coefficients close to zero.

What are two other names for L1 regularisation?

Lasso method. | Manhattan norm.

What is k-fold cross-validation?

Resampling random subsets and repeating training calculations, while holding 1 subset back for testing.

With decision trees, what kind of split creates less entropy?

A split where there are very uneven probabilities of the outcomes. Entropy is high = 1 if p = 0.5.

Python ML Principles Flashcards

Learn the main steps and sub-steps of ML. (32 cards)