data transformation Flashcards
What are the 6 ways of dealing with missing values?
1) ignore the tuple
2) fill in manually
3) use global constant
4) measure of central tendency
5) use mean/median for all samples belonging to same class as the tuple
6) use most probable value to fill in missing value
What is a nominal feature?
Categorical variable relating to names where each value represents some kind of category in no meaningful order
(T/F) Nominal data can be represented with numbers
True
What is a binary feature?
0s and 1s, symmetric if both states are equally valuable and carry the same weight and asymmetric if outcomes are not equally important
What is an ordinal feature?
A qualitative variable with possible values that have a meaningful order or ranking, but magnitude b/w successive values are not known
What are examples of ordinal features?
drink size, grade, professional rank
How can the central tendency of an ordinal feature be described?
by the mode or median
How can numeric data be represented as a classifier?
as-is or normalized
How can binary data be represented as a classifier?
0s and 1s
How can ordinal data be represented as a classifier?
ordered numeric
How can nominal data be represented as a classifier?
one-hot, numeric proxy
What are the 6 strategies for data transformation?
1) Smoothing
2) Attribute clustering
3) Aggregation
4) Normalization
5) Discretization
6) Concept hierarchy generation
What are the four different types of normalization techniques?
1) min-max
2) z-score
3) mean absolute deviation
4) decimal-scaling
What is min-max normalization?
Performs linear transformation on original data while preserving original relationships
What is z-score normalization?
(zero-mean), values for attribute A are normalized based on the mean and standard deviation of A
What is the mean absolute deviation?
The absolute value of deviation from the mean, used bc it’s more robust to outliers than standard deviation
What is decimal-scaling normalization?
Moves decimal point of values of attribute A where the number of decimal points depends on the maximum absolute value of A
What are three methods of discretization?
1) Binning
2) Histogram analysis
3) Cluster, DT, and correlation analyses
What is discretization?
Where raw values of numeric attributes are replaced by interval labels or conceptual labels. can also be a form of data reduction.
What are three forms of feature engineering?
1) Summarization: replacing multiple instances with column value averages
2) Kernelization: Take existing features and explode into high-dimensional space, or passing a set of features through a function that produces higher-order features
3) Representation learning: let neural network learn feature space in the form of a fector
Where does the target variable come from?
1) an existing feature that may be missing from some instances
2) a hand labeled feature
3) a feature value that will be known in the future