Data Cleaning and Processing Flashcards
Data types
- numeric, categorical
- static, dynamic (temporal)
Other kinds of data
- distributed data
- text, Web, meta data
- images, audio/video
missing attribute values, lack of certain attributes of
interest, or containing only aggregate data
e.g., occupation=“”
incomplete data
containing errors or outliers
e.g., Salary=“-10”
noisy data
containing discrepancies in codes or names
Age=“42” Birthday=“03/07/1997”
Was rating “1,2,3”, now rating “A, B, C”
inconsistent data
Why is data Processing Important?
- No quality data, no quality mining results
- Quality decision must be based on quality data
- Duplicate or missing data may cause incorrect or even misleading statistics
Multi-Dimensional Measure of Data Quality
- Accuracy
- Completeness
- Consistency
- Timeliness
- Believability
- Value added
- Interpretability
- Accessibility
Major Tasks in Data Processing
- Data Cleaning
- Data Integration
- Data Transformation
- Data Reduction
- Data discretion
Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies
number one problem in data warehousing
Data Cleaning
Data Cleaning Tasks:
- Fill in Missing Values
- Identify outliers and smooth out noisy data
- Correct Inconsistent Data
- Resolve Redundancy caused by data integration
Data is not always available (many tuples have no recorded values for several attributes, such as customer income in sales data)
Missing Data
Missing data causes:
- Equipment malfunction
- Inconsistent with other recorded data and thus deleted
- Data not entered due to misunderstanding
- Certain data may not be considered important at the time of entry
- Not register history or changes of the data
Handling missing data
- Ignore the tuple
- Fill in missing values manually: tedious and infeasible
- Fill it automatically
Fill missing data with
- A global constant e.g., unknown
- The attribute mean
- The most probable value:inferenced-based such as Bayesian formula, decision tree or EM algorithm
Random error or variance in a measured variable
Noisy Data
Incorrect attribute values may due to:
- Faulty data collection instruments
- Data entry problems
- Data transmission problems
Handling Noisy Data
- Binning Method
- Clustering
- Combined computer and human inspection
When reducing noise and trend analysis is needed
Smoothing by bin means
When keeping real-world constraints and preserving limits is important
When keeping real-world constraints and preserving limits is important
Detect and remove outliers, Data points inconsistent with the majority of data
Clustering
Integration of multiple databases or files
Data Integration
Integrate metadata from different sources
Entity identification problem: identify real world entities from multiple data
Schema Integration
Removing noise from data
Smoothing
scaled to fall within a small, specified range
Normalization