Data Preparation Concepts Flashcards

1
Q

Feature selection

A

Selecting a subset of features for the model. Can be optimised by looking at correlations between individual features and labels:

  • Using domain knowledge to drop irrelevant features
  • Drop features with low correlation to labelled data, low variance or lots of missing data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Feature engineering

A

Simplifying features and removing irrelevant information to improve model accuracy and speed. Examples include standardisation, normalisation, PCA, label encoding, one hot encoding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

FE: Standardisation

A

Also known as z-score normalisation, scales the values while taking into account standard deviation (i.e. scales mean to 0). If the standard deviation of features is different, their range also would differ from each other.
- Reduces the effect of the outliers in the features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

FE: Normalisation

A

Also known as min-max normalisation, scales all values in a fixed range between 0 and 1.

  • Doesn’t change the distribution of the feature
  • Due to decreased standard deviations the effects of outliers increases (should handle outliers first)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Principle Component Analysis (PCA)

A
  • Unsupervised ML algorithm commonly used in the data investigation stage
  • A form of ‘dimension reduction’ where we have 4 or more dimensions
  • Takes a snapshot of the data while retaining the important features (i.e the principal components)
  • Take the mean for each feature, centre a graph around this point and find the principal components with the longest dimensions
  • Order the direction that most influences the spread of data inside the plot
  • From there we can see the relationships between points
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Dealing with missing data

A
  • Imputing missing data points with the mean. Even though this is likely incorrect, it can suffice for model purposes
  • Can remove an entire row or sample
  • If there is lot of data missing from one feature, we may remove the entire feature/column
  • Must be careful not to remove anomalies that may look unbalanced, but are actually correct
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Unbalanced data

A

Occurs where certain classifications may be lost within the data (e.g. 3 points classified ‘x’, 97 classified ‘y’)
We can:
- Try and source more real data
- Oversample the minority, but may increase the importance of these data too much
- Synthesise data: what can vary in the data that won’t impact the classification? Domain knowledge needed
- Try different types of algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Label encoding

A
  • Replacing names/strings/categories with integers, having a lookup table elsewhere
  • Problem can be that the algorithm might interpret the integers as a ranking (then we use One Hot Encoding)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

One Hot Encoding

A
  • Introduce new features into the data
  • One feature for each of the labels we want to represent
  • 1s or 0s under each column as flags
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Splitting and randomisation

A
  • Randomise the test data, and ordering of the training data. Even if we know there is no clumping in the data
  • Clumping can occur where values have been collected over time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

RecordIO

A
  • Contains all values within one file (e.g. instead of separate image files)
  • Pipe mode streams data (as opposed to file mode, which opens one file at a time)
  • Faster training time and throughput
  • SageMaker works well with RecordIO formats (e.g. stream directly from S3, don’t need a local disk copy)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

FE: MaxAb scaling

A
  • Divide all values by the maximum of the absolute value for that feature
  • Doesn’t destroy sparsity, because we don’t centre the object through any measurement
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

FE: Robust scaling

A
  • Find median, Q25 and Q75 in the feature
  • Subtract median then divide by Q75 less Q25
  • Will be robust to outliers as they have minimal impact on median and quantiles
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

FE: Normaliser

A
  • Applied to ROWs not features
  • Rescales an observation
  • Widely used in text analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

FE: Numeric transformations

A
  • Polynomial transformations to better fit (e.g. x^2, x^3, etc.)
  • Higher order polynomial transformations of features, as well as including interactions between features can lead to overfitting
  • Non-linear transformations: Log, Sigmoid
  • Extrapolation beyond the range of training data can be different
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

FE: Text

A

Feature engineer: text-based
- Bag of Word model to turn document into a vector of numbers, one for each word
Count for each word in an observation of text
- Using TFIDF makes impact of common unimportant (e.g. the) words less