Data Preparation Concepts Flashcards

Question 1

Q

Feature selection

Answer

A

Selecting a subset of features for the model. Can be optimised by looking at correlations between individual features and labels:

Using domain knowledge to drop irrelevant features
Drop features with low correlation to labelled data, low variance or lots of missing data

Question 2

Q

Feature engineering

Answer

A

Simplifying features and removing irrelevant information to improve model accuracy and speed. Examples include standardisation, normalisation, PCA, label encoding, one hot encoding

Question 3

Q

FE: Standardisation

Answer

A

Also known as z-score normalisation, scales the values while taking into account standard deviation (i.e. scales mean to 0). If the standard deviation of features is different, their range also would differ from each other.
- Reduces the effect of the outliers in the features.

Question 4

Q

FE: Normalisation

Answer

A

Also known as min-max normalisation, scales all values in a fixed range between 0 and 1.

Doesn’t change the distribution of the feature
Due to decreased standard deviations the effects of outliers increases (should handle outliers first)

Question 5

Q

Principle Component Analysis (PCA)

Answer

A

Unsupervised ML algorithm commonly used in the data investigation stage
A form of ‘dimension reduction’ where we have 4 or more dimensions
Takes a snapshot of the data while retaining the important features (i.e the principal components)
Take the mean for each feature, centre a graph around this point and find the principal components with the longest dimensions
Order the direction that most influences the spread of data inside the plot
From there we can see the relationships between points

Question 6

Q

Dealing with missing data

Answer

A

Imputing missing data points with the mean. Even though this is likely incorrect, it can suffice for model purposes
Can remove an entire row or sample
If there is lot of data missing from one feature, we may remove the entire feature/column
Must be careful not to remove anomalies that may look unbalanced, but are actually correct

Question 7

Q

Unbalanced data

Answer

A

Occurs where certain classifications may be lost within the data (e.g. 3 points classified ‘x’, 97 classified ‘y’)
We can:
- Try and source more real data
- Oversample the minority, but may increase the importance of these data too much
- Synthesise data: what can vary in the data that won’t impact the classification? Domain knowledge needed
- Try different types of algorithms

Question 8

Q

Label encoding

Answer

A

Replacing names/strings/categories with integers, having a lookup table elsewhere
Problem can be that the algorithm might interpret the integers as a ranking (then we use One Hot Encoding)

Question 9

Q

One Hot Encoding

Answer

A

Introduce new features into the data
One feature for each of the labels we want to represent
1s or 0s under each column as flags

Question 10

Q

Splitting and randomisation

Answer

A

Randomise the test data, and ordering of the training data. Even if we know there is no clumping in the data
Clumping can occur where values have been collected over time

Question 11

Q

RecordIO

Answer

A

Contains all values within one file (e.g. instead of separate image files)
Pipe mode streams data (as opposed to file mode, which opens one file at a time)
Faster training time and throughput
SageMaker works well with RecordIO formats (e.g. stream directly from S3, don’t need a local disk copy)

Question 12

Q

FE: MaxAb scaling

Answer

A

Divide all values by the maximum of the absolute value for that feature
Doesn’t destroy sparsity, because we don’t centre the object through any measurement

Question 13

Q

FE: Robust scaling

Answer

A

Find median, Q25 and Q75 in the feature
Subtract median then divide by Q75 less Q25
Will be robust to outliers as they have minimal impact on median and quantiles

Question 14

Q

FE: Normaliser

Answer

A

Applied to ROWs not features
Rescales an observation
Widely used in text analysis

Question 15

Q

FE: Numeric transformations

Answer

A

Polynomial transformations to better fit (e.g. x^2, x^3, etc.)
Higher order polynomial transformations of features, as well as including interactions between features can lead to overfitting
Non-linear transformations: Log, Sigmoid
Extrapolation beyond the range of training data can be different

Question 16

Q

FE: Text

Answer

Study These Flashcards

A

Feature engineer: text-based
- Bag of Word model to turn document into a vector of numbers, one for each word
Count for each word in an observation of text
- Using TFIDF makes impact of common unimportant (e.g. the) words less

Data Preparation Concepts Flashcards

(16 cards)