Data preparation Flashcards
How to clean noisy entries
- for Kernel smoother where you have a fixed size window and you perform the average accordingly
- kernel average leads to smoother result but loose alot of data
then there is binning where the data is sorted and partitioned into bins then smoothed according to mean or median
Smooth the data by regression function
what are the main components of data preparation
● Feature extraction: derive meaningful features from data.
● Portability: cast data to unify the structure, and to fit the Algorithm.
● Data cleaning.
● Data Reduction.
● Feature Selection.
● Transformation
explain how the image feature extraction is done
It is based on building a histogram according to red, blue and green distribution
explain how the document feature extraction is done
● extracting from the document named entities
● stop word removal
● word count histogram
explain type portability
casting one format to another in order to fit the learning model , thus include : ● Text to numeric ● Time series to discrete sequence ● Time series to numeric ● Discrete sequence to numeric ● Spatial to numeric
why data cleaning is important
Since in real word data might be noisy, incomplete or inconsistent e.g., Age=“42” Birthday=“03/07/1997”
what is central tendency
consist of summarizing the data using one number, this include mean, median and mode
explain the relation between distribution skewing and mean median
mean= median = mode equally distributed
mode < median< median left skewed
mean < median< mode right skewed
how to clean inconsistent and missing records
delete the entire reccord, not safe and leads to lose info
estimate the missing value, lead to bias
global constant , mean for instance
why using the distance measure
distance reflects similarity
what is scaling
scaling changes the range of the data, generally we use it when want to measure how apart is the data
what is data normalization
it is a radical change of the data such that we want to obtain a data that is normally distributed