Data Preparation Concepts Flashcards
Feature selection
Selecting a subset of features for the model. Can be optimised by looking at correlations between individual features and labels:
- Using domain knowledge to drop irrelevant features
- Drop features with low correlation to labelled data, low variance or lots of missing data
Feature engineering
Simplifying features and removing irrelevant information to improve model accuracy and speed. Examples include standardisation, normalisation, PCA, label encoding, one hot encoding
FE: Standardisation
Also known as z-score normalisation, scales the values while taking into account standard deviation (i.e. scales mean to 0). If the standard deviation of features is different, their range also would differ from each other.
- Reduces the effect of the outliers in the features.
FE: Normalisation
Also known as min-max normalisation, scales all values in a fixed range between 0 and 1.
- Doesn’t change the distribution of the feature
- Due to decreased standard deviations the effects of outliers increases (should handle outliers first)
Principle Component Analysis (PCA)
- Unsupervised ML algorithm commonly used in the data investigation stage
- A form of ‘dimension reduction’ where we have 4 or more dimensions
- Takes a snapshot of the data while retaining the important features (i.e the principal components)
- Take the mean for each feature, centre a graph around this point and find the principal components with the longest dimensions
- Order the direction that most influences the spread of data inside the plot
- From there we can see the relationships between points
Dealing with missing data
- Imputing missing data points with the mean. Even though this is likely incorrect, it can suffice for model purposes
- Can remove an entire row or sample
- If there is lot of data missing from one feature, we may remove the entire feature/column
- Must be careful not to remove anomalies that may look unbalanced, but are actually correct
Unbalanced data
Occurs where certain classifications may be lost within the data (e.g. 3 points classified ‘x’, 97 classified ‘y’)
We can:
- Try and source more real data
- Oversample the minority, but may increase the importance of these data too much
- Synthesise data: what can vary in the data that won’t impact the classification? Domain knowledge needed
- Try different types of algorithms
Label encoding
- Replacing names/strings/categories with integers, having a lookup table elsewhere
- Problem can be that the algorithm might interpret the integers as a ranking (then we use One Hot Encoding)
One Hot Encoding
- Introduce new features into the data
- One feature for each of the labels we want to represent
- 1s or 0s under each column as flags
Splitting and randomisation
- Randomise the test data, and ordering of the training data. Even if we know there is no clumping in the data
- Clumping can occur where values have been collected over time
RecordIO
- Contains all values within one file (e.g. instead of separate image files)
- Pipe mode streams data (as opposed to file mode, which opens one file at a time)
- Faster training time and throughput
- SageMaker works well with RecordIO formats (e.g. stream directly from S3, don’t need a local disk copy)
FE: MaxAb scaling
- Divide all values by the maximum of the absolute value for that feature
- Doesn’t destroy sparsity, because we don’t centre the object through any measurement
FE: Robust scaling
- Find median, Q25 and Q75 in the feature
- Subtract median then divide by Q75 less Q25
- Will be robust to outliers as they have minimal impact on median and quantiles
FE: Normaliser
- Applied to ROWs not features
- Rescales an observation
- Widely used in text analysis
FE: Numeric transformations
- Polynomial transformations to better fit (e.g. x^2, x^3, etc.)
- Higher order polynomial transformations of features, as well as including interactions between features can lead to overfitting
- Non-linear transformations: Log, Sigmoid
- Extrapolation beyond the range of training data can be different