Data preparation Flashcards

Question 1

Q

How to clean noisy entries

Answer

A

for Kernel smoother where you have a fixed size window and you perform the average accordingly
kernel average leads to smoother result but loose alot of data

then there is binning where the data is sorted and partitioned into bins then smoothed according to mean or median
Smooth the data by regression function

Question 2

Q

what are the main components of data preparation

Answer

A

● Feature extraction: derive meaningful features from data.
● Portability: cast data to unify the structure, and to fit the Algorithm.
● Data cleaning.
● Data Reduction.
● Feature Selection.
● Transformation

Question 3

Q

explain how the image feature extraction is done

Answer

A

It is based on building a histogram according to red, blue and green distribution

Question 4

Q

explain how the document feature extraction is done

Answer

A

● extracting from the document named entities
● stop word removal
● word count histogram

Question 5

Q

explain type portability

Answer

A

casting one format to another in order to fit the learning model , thus include :
● Text to numeric
● Time series to discrete sequence
● Time series to numeric
● Discrete sequence to numeric
● Spatial to numeric

Question 6

Q

why data cleaning is important

Answer

A

Since in real word data might be noisy, incomplete or inconsistent e.g., Age=“42” Birthday=“03/07/1997”

Question 7

Q

what is central tendency

Answer

A

consist of summarizing the data using one number, this include mean, median and mode

Question 8

Q

explain the relation between distribution skewing and mean median

Answer

A

mean= median = mode equally distributed
mode < median< median left skewed
mean < median< mode right skewed

Question 9

Q

how to clean inconsistent and missing records

Answer

A

delete the entire reccord, not safe and leads to lose info
estimate the missing value, lead to bias
global constant , mean for instance

Question 10

Q

why using the distance measure

Answer

A

distance reflects similarity

Question 11

Q

what is scaling

Answer

A

scaling changes the range of the data, generally we use it when want to measure how apart is the data

Question 12

Q

what is data normalization

Answer

A

it is a radical change of the data such that we want to obtain a data that is normally distributed