Data Pre-Processing & Feature Analysis Flashcards

Question 1

Q

What is a boxplot good at visualising?

Answer

A

It’s good at visualising continuous variables
It’s good for checking for outliers.

Question 2

Q

What is a histogram good at visualising?

Answer

A

Good to visualise categorical variables
Can also be good for checking the distribution of the data

Question 3

Q

What is a scatter plot good at visualising?

Answer

A

It’s good at exploring the relationships between two variables

Question 4

Q

How would you explain one-hot encoding in regards to categorical variables?

Answer

A

One-hot encoding is the process of converting categorical variables into a binary format, which allows them to be interpreted by algorithms, whilst preserving their uniqueness and avoiding artificial ordering

Question 5

Q

How do you convert categorical variables using one-hot encoding?

Answer

A

Find the categorical variables’ indexes
Convert the variables to binary format, where 1 indicates their index, and every other index is equal to 0 i.e. dog = 10, cat = 01, etc…

Question 6

Q

What is data normalisation?

Answer

A

Data normalisation is a pre-processing technique that is used to adjust numerical values to all fit within a specific range, such as [0, 1], or [-1, 1].

Question 7

Q

What are the benefits of data normalisation?

Answer

A

They are all linear scaling methods, so they won’t affect the distribution of the original data
It improves the numerical stability of the Machine Learning model e.g. using gradient descent to optimise a network. If different features are in different value ranges, a fixed learning rate will likely overshoot the optima for certain features)

Question 8

Q

What is the equation for Z-Normalisation?

Answer

A

X_(norm) = (x - u)/q
Where:
X is the feature vector or original values
u is the mean of vector X
q is the standard deviation of vector X

Question 9

Q

What is the equation for Min-Max Normalisation?

Answer

A

X_(norm) = (X - min(X))/ (max(X) - min(X))
Where:
Min(X) and Max(X) are the minimum and maximum values of vector X respectively

Question 10

Q

What is the equation for Vector Normalisation?

Answer

A

X_(norm) = X/||X||
Where:
||X|| is the vector length of X

Question 11

Q

What is Data Imputation?

Answer

A

Data Imputation is a pre-processing step, where you impute missing values within the data based off the other samples used within the dataset/s.

Question 12

Q

What are the most common methods for Data Imputation?

Answer

A

Use mean or median values
Use most frequent values
Use k-nearest neighbour based on feature similarity
Use multivariate imputation by chained equations i.e. try filling missing data with multiple values, then pool the final result
Estimate the missing value using ML models, based on other features.

Question 13

Q

What is the ‘Curse of Dimensionality’?

Answer

A

Curse of Dimensionality is when the data samples are too sparse in the feature space i.e. the number of instances are not large enough to be densely distributed in the feature space.

Question 14

Q

When can the curse of dimensionality appear?

Answer

A

As the number of dimensions/features in the dataset increases, the data begins to become quite sparse, and many algorithms will begin to struggle to analyse and generalise from the data effectively.

Question 15

Q

What are two methods that can be used to solve the ‘Curse of Dimensionality’?

Answer

A

Increase the number of data samples exponentially with a linear increase of feature dimensions
Reduce the number of features (typically more feasible) by using feature selection methods and dimensionality reduction methods

Data Pre-Processing & Feature Analysis Flashcards

Key details from the corresponding lecture (15 cards)