Pre-Processing Flashcards
Why is pre-processing needed?
It is needed to ensure that data is accurate, complete, consistent, timely, believable and interpretable
What are the major preprocessing activities?
- Data cleaning
- Data intergration
- Data reduction
- Data transformation
What are examples of noisy data?
- Truncated field
- Text incorrectly spilt accross cells
- Incorrect data types
- Data that doesnt make logical sense
What is inconsistent data?
Data that contains infomation that has different representations or has values that dont make sense with the rest of the data
What is noisy data?
Data that contains additional needless infomation called noise
What are examples of inconsistent data?
- Different naming representations
- Different date formats
- Inconsistency between cells
- Sharing unique values
- Outliers
What are disguised missing values?
Missing values that take the default value predetermined by the program. To determine if this has occured, look for suspicious occurances in the data set
What is missing or incomplete data?
Data that is missing values in cells
What does MCAR stand for?
Missing completely at random. Probability of missing data on a variable is unrelated to any other variable or the variable itself
What does MNAR stand for?
Missing not at random. Missing values related to the values of that variable itself even after controlling for other variables.
What are some examples of causes of missing data?
- Equipment malfunction
- Not recorded due to missunderstanding
- May not be considered important at time of entry
- Deliberate