Lecture 4 - Data Understanding II Flashcards
What does data preparation do with the information provided by data understanding?
- Selects attributes
- Reduces the dimension of the data set
- Selects records
- Treats missing values
- Treats outliers
- Improves data quality
- Unify and transform data
What is feature extraction?
The construction of new features from the given attributes
For example instead of tasks finished, hours worked, number of hours needed usually for task -> creating new attribute “efficiency”
What can be used for simple models feature extraction?
Non-linear functions like x^p, divided by x, log(x), sin(x) etc.
How to predict y from x?
Prior knowledge, is the y dependant on x, visualization, trial and error
What’s the disadvantage of methods like PCA for feature extraction?
Dimensionality reduction techniques like PCA lead to features that can no longer be interpreted in a meaningful way, how to understand a feature that is a linear combination of 10 attributes?
What are some complex data type feature extractions?
Text data analysis -> frequency of keywords
Time series data analysis -> fourier or wavelet coefficients
Graph data-analysis -> number of vertices, edges
What does feature selection refer to?
Techniques used to choose a subset of the features that is as small as possible and sufficient for the data analysis
What are the reasons for feature selection?
- Prior knowledge: we know something is irrelevant
- Quality control: majority of values missing or bad
- Non-informative: eg all values same
- Redundancy: Identical or correlated values
What does record selection refer to? Why is it done?
Selecting only some rows of the data.
- Timeliness: older data might be outdated
- Representativeness: The sample in the database might not be representative for the whole population
- Rare events: Useful for something like stock market crashes
How to choose records for rare events?
- Artificially increase the proportion of the rare events by adding copies
- Choose only subset of the data
What does data cleansing refer to?
Detecting and correcting/removing inaccurate, incorrect or incomplete records from the data set
How to improve data quality?
- Turn all characters same sensitivity
- Remove spaces etc.
- Fix the format of numbers
- Split the fields “Chocolate, 100g” -> “chocolate” “100.0”
- Normalize the writing
What are the four discretizations?
- Equi-width discretization: Splits range into same length intervals [0-20, 20-40, 40-60]
- Equi-frequency discretization: Splits range into intervals with roughly the same number of records [4,4,4,4]
- V-optimal discretization: Minimizes the sum of n*V, where n is the number of data objects and V is the sample variance
- Minimal entropy discretization: minimizes the uncertainty
Why should data sometimes be normalized?
To guarantee impartiality for models that use distances
What is min-max normalization?
All the values are scaled between 0 and 1, outliers affect a lot.
x = x-min_x / (max_x - min_x)