Data Preparation Flashcards

1
Q

Preprocessing

A

Data Cleaning: missing, noisy, outliers, incosistencies, duplicate.
Data Integration: multiple databases, data cubes, or files.
Data Reduction: dimensionality, numerosity, compression
Data transformation and discretization: normalization, hierarchy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Steps

A

Aggregation: combining two or more. Data with ess variability.
Sampling: too expensive or time consuming.
Random, without replacement, with replacement, stratified.
Dimens. Reduction: avoid curse of dimensionality. Amount of time and memory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Reduce noise. PCA (numeric), SVD.

A

Feature subset selec: reduce dimens. Redundant features. Irrelevant.
Feature creation: birthday from age.
Discret. and Binarization: divide the range of cont attributes. Supervised (using class), Unsupersvised.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

PCA

A

Goal is to find a projection that captures the largest amount of variation in data.
Normalize input data.
Compute K orthonormal vectors. (Principal Comp)
Input data is linear combination of k principal component vectors
Principal components sorted in order of decreasing significance or strength
We can eliminate weak components (low variance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

SVD

A
M=UEV^(T)
	M: original data (n x m)
	U: n examples using r new concepts (n x r)
	E: strength each concept (r x r)
	V: m terms, r concepts (m x r)
For dimensionality reduction
	Smallest k singular values to zero.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Feature SubSelection

A

Brute Force: try all possible feature subsets as input
Embedded approaches: naturally as part of data mining algorithm
Filter: independent from data mining algorithm
Wrapper: black box to find best subset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly