Lecture 4 - Data Understanding II Flashcards
What does data preparation do with the information provided by data understanding?
- Selects attributes
- Reduces the dimension of the data set
- Selects records
- Treats missing values
- Treats outliers
- Improves data quality
- Unify and transform data
What is feature extraction?
The construction of new features from the given attributes
For example instead of tasks finished, hours worked, number of hours needed usually for task -> creating new attribute “efficiency”
What can be used for simple models feature extraction?
Non-linear functions like x^p, divided by x, log(x), sin(x) etc.
How to predict y from x?
Prior knowledge, is the y dependant on x, visualization, trial and error
What’s the disadvantage of methods like PCA for feature extraction?
Dimensionality reduction techniques like PCA lead to features that can no longer be interpreted in a meaningful way, how to understand a feature that is a linear combination of 10 attributes?
What are some complex data type feature extractions?
Text data analysis -> frequency of keywords
Time series data analysis -> fourier or wavelet coefficients
Graph data-analysis -> number of vertices, edges
What does feature selection refer to?
Techniques used to choose a subset of the features that is as small as possible and sufficient for the data analysis
What are the reasons for feature selection?
- Prior knowledge: we know something is irrelevant
- Quality control: majority of values missing or bad
- Non-informative: eg all values same
- Redundancy: Identical or correlated values
What does record selection refer to? Why is it done?
Selecting only some rows of the data.
- Timeliness: older data might be outdated
- Representativeness: The sample in the database might not be representative for the whole population
- Rare events: Useful for something like stock market crashes
How to choose records for rare events?
- Artificially increase the proportion of the rare events by adding copies
- Choose only subset of the data
What does data cleansing refer to?
Detecting and correcting/removing inaccurate, incorrect or incomplete records from the data set
How to improve data quality?
- Turn all characters same sensitivity
- Remove spaces etc.
- Fix the format of numbers
- Split the fields “Chocolate, 100g” -> “chocolate” “100.0”
- Normalize the writing
What are the four discretizations?
- Equi-width discretization: Splits range into same length intervals [0-20, 20-40, 40-60]
- Equi-frequency discretization: Splits range into intervals with roughly the same number of records [4,4,4,4]
- V-optimal discretization: Minimizes the sum of n*V, where n is the number of data objects and V is the sample variance
- Minimal entropy discretization: minimizes the uncertainty
Why should data sometimes be normalized?
To guarantee impartiality for models that use distances
What is min-max normalization?
All the values are scaled between 0 and 1, outliers affect a lot.
x = x-min_x / (max_x - min_x)
What is z-score standardization?
scales the data to have a mean of 0 and deviation of 1
x = x - mean(x) / variance(x)
What is robust z-score standardization?
x = x - median(x) / IQR(x)
What is decimal scaling?
For attribute X and the smallest integer value s larger than log_10(max(x))
x = x/10^s
What does centering the data matrix mean?
Removing the mean from all the rows of matrrix X, it moves the data to the center.
What the number of possible 2D scatter plots for attributes m?
m(m-1), so for 50 50*49 = 2450
Why do we want to change data to lower dimensional?
There could be hundreds of thousands of attributes, to include them all in a plot we need to define a measure that evaluates lower-dimensional plots of daata in term of how well the plot preserves the original structure
What are parallel coordinates?
They draw the coordinate axes parallel to each other, so that there is no limitation for the number of axes to be displayed.
Aka plot for multiple attributes like \/_/_
What is the basic idea for dimensionality reduction?
Change the data from n-dimensional space to q-dimensional space (q= 2 or 3)
R^n -> R^q
What is a linear map?
New attributes are linear combinations of old ones.
new_feature = 0.5feature_1 + 0.3feature_2
How does PCA work?
PCA uses the variance in the data as the structure preservation criterion? It then tries to preserve as much of the original variance of the data when projected to a lower-dimensional space.
It uses an orthogonal transformation to convert a set of observation of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
How does the principal components get determined?
The first PC has the largest possible variance, and each succeedin component in turn has the highest variance under the constraint that it is orthogonal to the preceding components
Is PCA sensitive to the relative scaling of the original variables?
Yes, usually Z-score standardized
What is an eigenvector?
The number of principle components.
what is t-SNE?
t-distributed stochastic neighbor embedding. Non-linear dimensionality reduction method
How does t-SNE work?
Similar items end up close together points and dissimilar at distant points
Generates clusters even when the data doesn’t support this
What are the two stages of t-SNE?
- A probability distribution over pairs of high-dimensional objects is constructed so that similar objects receive higher probability while dissimilar points receive lower probability
- A similar probability distribution is generated for the points in the low-dimensional map
What are some dimensionality reduction methods?
- PCA
- t-SNE
- Kernel PCA (non-linear)
- Linear discriminant analysis (used in classification, finds low dimensional reprsentation of the data such that separates classes well)