Data Preprocessing Flashcards

Question 1

Q

8 Step of Data Preprocessing

Answer

A

Data cleaning
Sampling
Aggregation
Discretization and Binarization
Feature Transformation & Scaling
Dimensionality Reduction
Feature subset selection
Feature Creation

Question 2

Q

5 Type of Error Data

Answer

A

Missing data
Structural errors - Typographical errors and other inconsistencies.
Duplicate data
Irrelevant data
Outliers

Question 3

Q

Sampling

Answer

A

The process of selecting a representative subset of individuals, items, or events from a larger population, in order to estimate or infer information about the population as a whole.

Question 4

Q

2 Type of Sampling

Answer

A

Simple Random Sampling
Stratified Sampling

Question 5

Q

4 Motivation of Aggregation

Answer

A

A change of scope or scale
Data reduction
Noise reduction
Computation

Question 6

Q

1 Disadvantage of Aggregation

Answer

A

Potential loss of interesting details.

Question 7

Q

3 Type of Unsupervised Discretization

Answer

A

Equal Interval Width
Equal Frequency
K-means - Divide into discrete group

Question 8

Q

1 Type of Supervised Discretization

Answer

A

Decision Tree

Question 9

Q

Feature Transformation & Scaling

Answer

A

A function that maps the entire set of values of a given attribute to a new set of replacement values.

Question 10

Q

2 Algorithm sensitive to feature scaling

Answer

A

Gradient Descent Based Algorithms
Distance-Based Algorithm

Question 11

Q

1 Algorithm insensitive to the scale of the features

Answer

A

Tree-Based Algorithms

Question 12

Q

3 Type of Feature Transformation

Answer

A

|x|Transformation - Symmetric variable around zero.
1/x Transformation - Upper end of distribution to lower end.
Log Transform - Convert a skewed distribution to a normal distribution / less-skewed distribution.

Question 13

Q

4 Type of Feature Scaling

Answer

A

Max Abs Scaling (Between -1 and +1)
Min-Max Scaling (Between 0 and +1)
Standardisation (Z-Score Normalisation) (Between -3 and +3)
Normal distribution / Gaussian distribution (Bell-shaped curve)

Question 14

Q

Normal distribution

Answer

A

A statistical concept that describes a probability distribution of a random variable.

Question 15

Q

Dimensionality Reduction

Answer

A

The process of reducing the number of input features in a dataset while preserving the key information or patterns.

Question 16

Q

Principal Components / Latent Variables

Answer

Study These Flashcards

A

Features that are combinations of the original features

Question 17

Q

Principal Component Analysis (PCA)

Answer

Study These Flashcards

A

A technique to transform a high-dimensional dataset into a lower dimensional space, while retaining as much of the information in the original data as possible.

Question 18

Q

5 Motivations of Dimensionality Reduction

Answer

Study These Flashcards

A

Reduced noise
Enhanced interpretability
To visualise the data and gain insights
To speed up a subsequent training algorithm
To save space (compression)

Question 19

Q

4 Drawbacks of Dimensionality Reduction

Answer

Study These Flashcards

A

Some information is lost
Transformed features are often hard to interpret
Can be computationally intensive
Adds some complexity to pipelines

Question 20

Q

3 Approach of Feature Subset Selection

Answer

Study These Flashcards

A

Embedded approaches - Occurs naturally as part of the data mining algorithm, e.g. Decision tree

Filter approaches - Selected before running the data mining algorithm; Low correlation

Wrapper approaches - Find the best subset of attributes
Forward selection - Take in one by one
Backward selection - Take out one by one
Bi-directional elimination (Stepwise Selection)

Data Preprocessing Flashcards

(20 cards)