Data Preprocessing Flashcards

1
Q

8 Step of Data Preprocessing

A

Data cleaning
Sampling
Aggregation
Discretization and Binarization
Feature Transformation & Scaling
Dimensionality Reduction
Feature subset selection
Feature Creation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

5 Type of Error Data

A

Missing data
Structural errors - Typographical errors and other inconsistencies.
Duplicate data
Irrelevant data
Outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Sampling

A

The process of selecting a representative subset of individuals, items, or events from a larger population, in order to estimate or infer information about the population as a whole.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

2 Type of Sampling

A

Simple Random Sampling
Stratified Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

4 Motivation of Aggregation

A

A change of scope or scale
Data reduction
Noise reduction
Computation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

1 Disadvantage of Aggregation

A

Potential loss of interesting details.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

3 Type of Unsupervised Discretization

A

Equal Interval Width
Equal Frequency
K-means - Divide into discrete group

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

1 Type of Supervised Discretization

A

Decision Tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Feature Transformation & Scaling

A

A function that maps the entire set of values of a given attribute to a new set of replacement values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

2 Algorithm sensitive to feature scaling

A

Gradient Descent Based Algorithms
Distance-Based Algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

1 Algorithm insensitive to the scale of the features

A

Tree-Based Algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

3 Type of Feature Transformation

A

|x|Transformation - Symmetric variable around zero.
1/x Transformation - Upper end of distribution to lower end.
Log Transform - Convert a skewed distribution to a normal distribution / less-skewed distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

4 Type of Feature Scaling

A

Max Abs Scaling (Between -1 and +1)
Min-Max Scaling (Between 0 and +1)
Standardisation (Z-Score Normalisation) (Between -3 and +3)
Normal distribution / Gaussian distribution (Bell-shaped curve)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Normal distribution

A

A statistical concept that describes a probability distribution of a random variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Dimensionality Reduction

A

The process of reducing the number of input features in a dataset while preserving the key information or patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Principal Components / Latent Variables

A

Features that are combinations of the original features

17
Q

Principal Component Analysis (PCA)

A

A technique to transform a high-dimensional dataset into a lower dimensional space, while retaining as much of the information in the original data as possible.

18
Q

5 Motivations of Dimensionality Reduction

A

Reduced noise
Enhanced interpretability
To visualise the data and gain insights
To speed up a subsequent training algorithm
To save space (compression)

19
Q

4 Drawbacks of Dimensionality Reduction

A

Some information is lost
Transformed features are often hard to interpret
Can be computationally intensive
Adds some complexity to pipelines

20
Q

3 Approach of Feature Subset Selection

A

Embedded approaches - Occurs naturally as part of the data mining algorithm, e.g. Decision tree

Filter approaches - Selected before running the data mining algorithm; Low correlation

Wrapper approaches - Find the best subset of attributes
Forward selection - Take in one by one
Backward selection - Take out one by one
Bi-directional elimination (Stepwise Selection)