2. Data Pre-processing Flashcards

1
Q

Types/Techniques of Data Preprocessing

A
  • Aggregation
  • Sampling (Simple random, with replacement and stratified)
  • Dimensionality reduction
  • Discretization + Binarization
  • Attribute Transformation (normalization, standardization)
  • Attribute Subset Selection
  • Attribute Creation

( Data visualization for relationship visualization )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Aggregation purpose

A
  • change of scale
  • data reduction
  • data variability reduction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Sampling purpose

A
  • Reduce time and cost associated with re-sampling the full data set
    == data reduction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Sampling techniques

A
  • Simple random sampling
  • Replacement
  • Stratified (split into partitions)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Discretization purpose

A

Change of scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Binarization purpose

A

Change of scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Attribute transformation purpose

A

Change of scale

Mapping of complete values
e.g. simple functions: log(x), |x|, e^x, x^k, normalization, standardization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Dimensionality reduction purpose

A
  • Reduce time and memory costs for data mining algorithms
  • Remove irrelevant features (noise)
  • Makes visualization easier

Example: PCA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Dimensionality reduction techniques

A
  • Principle Component Analysis (PCA)
  • Attribute Subset Selection (if data set contains irrelevant or redundant/duplicate information)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Attribute creation purpose

A

to Capture important information better

( Map attributes on newly created ones )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data visualization techniques

A
  • Box plot: Percentiles + outliers
  • Scatter plot: relationship analysis (2D 3D)
  • Mean graph: relationship analysis (categorical)
  • Matrix plot: relationship analysis (often normalized to prevent domination)
  • Parallel Coordinates (each object is a line): relationship analysis (=> higher D)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly