Data Preprocessing Flashcards

1
Q

7 DATA PROCESSING TASKS / METHODS

A
  1. Aggregation
  2. Sampling
  3. Dimensionality Reduction
  4. Feature Subset Selection
  5. Feature Creation
  6. Discretization and Binarization
  7. Attribute Transformation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

combining two or more attributes into a single attribute.

A

Aggregation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

3 PURPOSE OF AGGREGATION

A
  • Data Reduction
  • Change of Scale
  • More Stable Data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

is the main technique employe for data selection.

A

Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

4 TYPES OF SAMPLING

A
  • Sampling without replacement
  • Sampling with replacement
  • Simple Random Sampling
  • Stratified Sampling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

a type of sampling where each item is selected, it is removed from the population.

A

Sampling with replacement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

a type of sampling where objects are not removed from the population as they are selected.

A

Sampling without replacement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

a type of sampling where there is an equal probability of selecting any particular items.

A

Simple Random Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

a type of sampling where it splits the data into several partitions, then drawn random samples from each partition.

A

Stratified Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

is the number of samples in a data set.

A

SAMPLE SIZE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

2 SAMPLE SIZE DETERMINATION

A
  • Statistics
  • Machine Learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

a determination where it implies the confidence interval, for parameter estimate or desires statistical power of test.

A

Statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

a determination where it implies that often more is better, cross-validated accuracy.

A

Machine Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

when dimensionality increases, the size of the data space grows exponentially.

A

Curse of Dimensionality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

its purpose of to avoid the curse of dimensionality.

A

Dimensionality Reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Reduces the amount of time and memory required by data mining algorithms.

A

Dimensionality Reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

3 TECHNIQUES FOR DIMENSION REDUCTION

A
  1. Principal Component Analysis
  2. ISOMAP
  3. Low Dimensional Embedding
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

is another way to reduce dimensionality of data.

A

Feature Subset Selection

19
Q

2 TYPES OF FEATURES

A
  1. Redundant Features
  2. Irrelevant Features
20
Q

is a type of feature where there are many duplicate or all of the information contained in one or more other attribute.

A

Redundant Features

21
Q

is a type of feature where it contains no information that is useful foe the data mining task at hand.

A

Irrelevant Features

22
Q

4 APPROACHES IN FEATURE SUBSET SELECTION

A
  1. Embedded Approach
  2. Filter Approach
  3. Brute-force Approach
  4. Wrapper Approach
23
Q

feature selection occurs naturally as part of the data mining algorithm,

A

Embedded Approach

24
Q

features are selected before data mining algorithm is run.

A

Filter Approach

25
Q

try all possible feature subsets as input to data mining algorithm and choose the best.

A

Brute-force Approach

26
Q

use the data mining algorithm as a black box to find bests subsets of attributes.

A

Wrapper Approach

27
Q

creates new attributes that can capture the important information in a data set much more efficiently than the original attributes.

A

Feature Creation

28
Q

3 GENERAL METHODOLOGIES FOR FEATURE CREATION

A
  1. Feature Extraction
  2. Feature Construction / Feature Engineering
  3. Mapping Data to New Space
29
Q

is a methodology in feature selection where it is domain specific.

A

Feature Extraction

30
Q

is a methodology in feature extraction where it combines features.

A

Feature Construction / Feature Engineering

31
Q

2 WAYS OF MAPPING DATA TO NEW SPACE

A
  • Fourier Transform
  • Wavelet Transfor
32
Q

a function that maps the entire set of values of a given attributes to a new set of replacement values such that each old value can be identifies with one of the new values.

A

Attribute Transformation

33
Q

are numerical measure of how alike two data objects are.

A

Similarity

34
Q

numerical measure of how different two data objects are.

A

Dissimilarity

35
Q

refers to a similarity or dissimilarity.

A

Proximity

36
Q

6 METHODS TO KNOW THE SIMILARITY OR DISSIMILARITY:

A
  • Euclidean Distance
  • Mikowski Distance
  • Mahalanobis Distance
  • Cosine Similarity
  • Correlation
  • Rank Correlation
37
Q

is the generalization of Euclidean.

A

Mikowski Distance

38
Q

measures the linear relationship between two variables.

A

Correlation

39
Q

measures the degree of similarity between two ratings.

A

Rank Correlation

40
Q

describes the likelihood of a random variable taking a given value.

A

Probability Density (Function)

41
Q

is a non-parametric way to estimate the probability density function of a random variable.

A

Kernel Desnity Estimation

42
Q

implies that the simplest approach is to divide region into a number of rectangular cells of equal volume and define density as # of points the cell contains.

A

Euclidean Density - Cell-Based

43
Q

implies that the Euclidean density is the number of points within a specified radius of the point.

A

Euclidean Density - Center-Based