Data Preparation Flashcards

Question 1

Q

Preprocessing

Answer

A

Data Cleaning: missing, noisy, outliers, incosistencies, duplicate.
Data Integration: multiple databases, data cubes, or files.
Data Reduction: dimensionality, numerosity, compression
Data transformation and discretization: normalization, hierarchy.

Question 2

Q

Steps

Answer

A

Aggregation: combining two or more. Data with ess variability.
Sampling: too expensive or time consuming.
Random, without replacement, with replacement, stratified.
Dimens. Reduction: avoid curse of dimensionality. Amount of time and memory.

Question 3

Q

Reduce noise. PCA (numeric), SVD.

Answer

A

Feature subset selec: reduce dimens. Redundant features. Irrelevant.
Feature creation: birthday from age.
Discret. and Binarization: divide the range of cont attributes. Supervised (using class), Unsupersvised.

Question 4

Q

PCA

Answer

A

Goal is to find a projection that captures the largest amount of variation in data.
Normalize input data.
Compute K orthonormal vectors. (Principal Comp)
Input data is linear combination of k principal component vectors
Principal components sorted in order of decreasing significance or strength
We can eliminate weak components (low variance)

Question 5

Q

SVD

Answer

A

M=UEV^(T)
	M: original data (n x m)
	U: n examples using r new concepts (n x r)
	E: strength each concept (r x r)
	V: m terms, r concepts (m x r)
For dimensionality reduction
	Smallest k singular values to zero.

Question 6

Q

Feature SubSelection

Answer

A

Brute Force: try all possible feature subsets as input
Embedded approaches: naturally as part of data mining algorithm
Filter: independent from data mining algorithm
Wrapper: black box to find best subset.

Data Preparation Flashcards

(6 cards)