Data Preparation Flashcards

1
Q

The Data Understanding phase provides general insights into the data such as?

A

-The existence and type of missing values, outliers, attributes
-Dependencies between attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The Data Preparation phase uses collected information to:

A

-select attributes
-reduce the dimensionality of the data set
-select records
-treat missing values
-treat outliers
-transform data
-improve data quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is transforming data via non-linear transformations?

A

transforming attributes to be used in linear methods using non-linear functions like 1/x, log(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to find good transformations?

A
  1. prior knowledge (how does x depend on y)
  2. visualization (scatter plot x vs y)
  3. trial and error (see how different transformations do)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

PCA in feature extraction

A

PCA can be used as a feature extraction method, but the features can be difficult to interpret after.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Feature extraction for complex data types

A

-Text data analysis: frequency of keywords
-Time series/image data: Fourier or wavelet coefficients
-Graph data: number of vertices, edges

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define Feature selection

A

Choosing a subset of features (attributes) as small as possible and sufficient for data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why would features be removed?

A

-not relevant
-bad quality (missing, incorrect data)
-non-informative (has the same value for all instances)
-redundancy (identical or closely related to another feature)
-timeliness (old)
-representativeness
-rare events

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data cleansing/scrubbing

A

deleting, correcting, or removing:
inaccurate, incorrect or incomplete data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Types of missing values

A

MCAR - missing completely at random
OAR -observed at random
MAR - missing at random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Discretization techniques

A

Splitting a numerical range into finite number of bins. Equi-width, equi-frequency…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Z score standardization

A

(x - mean(x)/(standard deviation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly