Data Pre-Processing & Feature Analysis Flashcards
Key details from the corresponding lecture
What is a boxplot good at visualising?
It’s good at visualising continuous variables
It’s good for checking for outliers.
What is a histogram good at visualising?
Good to visualise categorical variables
Can also be good for checking the distribution of the data
What is a scatter plot good at visualising?
It’s good at exploring the relationships between two variables
How would you explain one-hot encoding in regards to categorical variables?
One-hot encoding is the process of converting categorical variables into a binary format, which allows them to be interpreted by algorithms, whilst preserving their uniqueness and avoiding artificial ordering
How do you convert categorical variables using one-hot encoding?
- Find the categorical variables’ indexes
- Convert the variables to binary format, where 1 indicates their index, and every other index is equal to 0 i.e. dog = 10, cat = 01, etc…
What is data normalisation?
Data normalisation is a pre-processing technique that is used to adjust numerical values to all fit within a specific range, such as [0, 1], or [-1, 1].
What are the benefits of data normalisation?
They are all linear scaling methods, so they won’t affect the distribution of the original data
It improves the numerical stability of the Machine Learning model e.g. using gradient descent to optimise a network. If different features are in different value ranges, a fixed learning rate will likely overshoot the optima for certain features)
What is the equation for Z-Normalisation?
X_(norm) = (x - u)/q
Where:
X is the feature vector or original values
u is the mean of vector X
q is the standard deviation of vector X
What is the equation for Min-Max Normalisation?
X_(norm) = (X - min(X))/ (max(X) - min(X))
Where:
Min(X) and Max(X) are the minimum and maximum values of vector X respectively
What is the equation for Vector Normalisation?
X_(norm) = X/||X||
Where:
||X|| is the vector length of X
What is Data Imputation?
Data Imputation is a pre-processing step, where you impute missing values within the data based off the other samples used within the dataset/s.
What are the most common methods for Data Imputation?
Use mean or median values
Use most frequent values
Use k-nearest neighbour based on feature similarity
Use multivariate imputation by chained equations i.e. try filling missing data with multiple values, then pool the final result
Estimate the missing value using ML models, based on other features.
What is the ‘Curse of Dimensionality’?
Curse of Dimensionality is when the data samples are too sparse in the feature space i.e. the number of instances are not large enough to be densely distributed in the feature space.
When can the curse of dimensionality appear?
As the number of dimensions/features in the dataset increases, the data begins to become quite sparse, and many algorithms will begin to struggle to analyse and generalise from the data effectively.
What are two methods that can be used to solve the ‘Curse of Dimensionality’?
Increase the number of data samples exponentially with a linear increase of feature dimensions
Reduce the number of features (typically more feasible) by using feature selection methods and dimensionality reduction methods