Data Preparation Flashcards
The Data Understanding phase provides general insights into the data such as?
-The existence and type of missing values, outliers, attributes
-Dependencies between attributes
The Data Preparation phase uses collected information to:
-select attributes
-reduce the dimensionality of the data set
-select records
-treat missing values
-treat outliers
-transform data
-improve data quality
What is transforming data via non-linear transformations?
transforming attributes to be used in linear methods using non-linear functions like 1/x, log(x)
How to find good transformations?
- prior knowledge (how does x depend on y)
- visualization (scatter plot x vs y)
- trial and error (see how different transformations do)
PCA in feature extraction
PCA can be used as a feature extraction method, but the features can be difficult to interpret after.
Feature extraction for complex data types
-Text data analysis: frequency of keywords
-Time series/image data: Fourier or wavelet coefficients
-Graph data: number of vertices, edges
Define Feature selection
Choosing a subset of features (attributes) as small as possible and sufficient for data analysis.
Why would features be removed?
-not relevant
-bad quality (missing, incorrect data)
-non-informative (has the same value for all instances)
-redundancy (identical or closely related to another feature)
-timeliness (old)
-representativeness
-rare events
Data cleansing/scrubbing
deleting, correcting, or removing:
inaccurate, incorrect or incomplete data
Types of missing values
MCAR - missing completely at random
OAR -observed at random
MAR - missing at random
Discretization techniques
Splitting a numerical range into finite number of bins. Equi-width, equi-frequency…
Z score standardization
(x - mean(x)/(standard deviation)