Data preparation Flashcards

1
Q

How to clean noisy entries

A
  • for Kernel smoother where you have a fixed size window and you perform the average accordingly
  • kernel average leads to smoother result but loose alot of data

then there is binning where the data is sorted and partitioned into bins then smoothed according to mean or median
Smooth the data by regression function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what are the main components of data preparation

A

● Feature extraction: derive meaningful features from data.
● Portability: cast data to unify the structure, and to fit the Algorithm.
● Data cleaning.
● Data Reduction.
● Feature Selection.
● Transformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

explain how the image feature extraction is done

A

It is based on building a histogram according to red, blue and green distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

explain how the document feature extraction is done

A

● extracting from the document named entities
● stop word removal
● word count histogram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

explain type portability

A
casting one format to another in order to fit the learning model , thus include :
● Text to numeric
● Time series to discrete sequence
● Time series to numeric
● Discrete sequence to numeric
● Spatial to numeric
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

why data cleaning is important

A

Since in real word data might be noisy, incomplete or inconsistent e.g., Age=“42” Birthday=“03/07/1997”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is central tendency

A

consist of summarizing the data using one number, this include mean, median and mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

explain the relation between distribution skewing and mean median

A

mean= median = mode equally distributed
mode < median< median left skewed
mean < median< mode right skewed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

how to clean inconsistent and missing records

A

delete the entire reccord, not safe and leads to lose info
estimate the missing value, lead to bias
global constant , mean for instance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

why using the distance measure

A

distance reflects similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is scaling

A

scaling changes the range of the data, generally we use it when want to measure how apart is the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is data normalization

A

it is a radical change of the data such that we want to obtain a data that is normally distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly