Data Pre-Processing & Feature Analysis Flashcards

Key details from the corresponding lecture

1
Q

What is a boxplot good at visualising?

A

It’s good at visualising continuous variables
It’s good for checking for outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a histogram good at visualising?

A

Good to visualise categorical variables
Can also be good for checking the distribution of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a scatter plot good at visualising?

A

It’s good at exploring the relationships between two variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How would you explain one-hot encoding in regards to categorical variables?

A

One-hot encoding is the process of converting categorical variables into a binary format, which allows them to be interpreted by algorithms, whilst preserving their uniqueness and avoiding artificial ordering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you convert categorical variables using one-hot encoding?

A
  1. Find the categorical variables’ indexes
  2. Convert the variables to binary format, where 1 indicates their index, and every other index is equal to 0 i.e. dog = 10, cat = 01, etc…
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is data normalisation?

A

Data normalisation is a pre-processing technique that is used to adjust numerical values to all fit within a specific range, such as [0, 1], or [-1, 1].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the benefits of data normalisation?

A

They are all linear scaling methods, so they won’t affect the distribution of the original data
It improves the numerical stability of the Machine Learning model e.g. using gradient descent to optimise a network. If different features are in different value ranges, a fixed learning rate will likely overshoot the optima for certain features)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the equation for Z-Normalisation?

A

X_(norm) = (x - u)/q
Where:
X is the feature vector or original values
u is the mean of vector X
q is the standard deviation of vector X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the equation for Min-Max Normalisation?

A

X_(norm) = (X - min(X))/ (max(X) - min(X))
Where:
Min(X) and Max(X) are the minimum and maximum values of vector X respectively

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the equation for Vector Normalisation?

A

X_(norm) = X/||X||
Where:
||X|| is the vector length of X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Data Imputation?

A

Data Imputation is a pre-processing step, where you impute missing values within the data based off the other samples used within the dataset/s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the most common methods for Data Imputation?

A

Use mean or median values
Use most frequent values
Use k-nearest neighbour based on feature similarity
Use multivariate imputation by chained equations i.e. try filling missing data with multiple values, then pool the final result
Estimate the missing value using ML models, based on other features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the ‘Curse of Dimensionality’?

A

Curse of Dimensionality is when the data samples are too sparse in the feature space i.e. the number of instances are not large enough to be densely distributed in the feature space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When can the curse of dimensionality appear?

A

As the number of dimensions/features in the dataset increases, the data begins to become quite sparse, and many algorithms will begin to struggle to analyse and generalise from the data effectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are two methods that can be used to solve the ‘Curse of Dimensionality’?

A

Increase the number of data samples exponentially with a linear increase of feature dimensions
Reduce the number of features (typically more feasible) by using feature selection methods and dimensionality reduction methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly