Data Cleaning and Processing Flashcards
Data types
- numeric, categorical
- static, dynamic (temporal)
Other kinds of data
- distributed data
- text, Web, meta data
- images, audio/video
missing attribute values, lack of certain attributes of
interest, or containing only aggregate data
e.g., occupation=“”
incomplete data
containing errors or outliers
e.g., Salary=“-10”
noisy data
containing discrepancies in codes or names
Age=“42” Birthday=“03/07/1997”
Was rating “1,2,3”, now rating “A, B, C”
inconsistent data
Why is data Processing Important?
- No quality data, no quality mining results
- Quality decision must be based on quality data
- Duplicate or missing data may cause incorrect or even misleading statistics
Multi-Dimensional Measure of Data Quality
- Accuracy
- Completeness
- Consistency
- Timeliness
- Believability
- Value added
- Interpretability
- Accessibility
Major Tasks in Data Processing
- Data Cleaning
- Data Integration
- Data Transformation
- Data Reduction
- Data discretion
Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies
number one problem in data warehousing
Data Cleaning
Data Cleaning Tasks:
- Fill in Missing Values
- Identify outliers and smooth out noisy data
- Correct Inconsistent Data
- Resolve Redundancy caused by data integration
Data is not always available (many tuples have no recorded values for several attributes, such as customer income in sales data)
Missing Data
Missing data causes:
- Equipment malfunction
- Inconsistent with other recorded data and thus deleted
- Data not entered due to misunderstanding
- Certain data may not be considered important at the time of entry
- Not register history or changes of the data
Handling missing data
- Ignore the tuple
- Fill in missing values manually: tedious and infeasible
- Fill it automatically
Fill missing data with
- A global constant e.g., unknown
- The attribute mean
- The most probable value:inferenced-based such as Bayesian formula, decision tree or EM algorithm
Random error or variance in a measured variable
Noisy Data
Incorrect attribute values may due to:
- Faulty data collection instruments
- Data entry problems
- Data transmission problems
Handling Noisy Data
- Binning Method
- Clustering
- Combined computer and human inspection
When reducing noise and trend analysis is needed
Smoothing by bin means
When keeping real-world constraints and preserving limits is important
When keeping real-world constraints and preserving limits is important
Detect and remove outliers, Data points inconsistent with the majority of data
Clustering
Integration of multiple databases or files
Data Integration
Integrate metadata from different sources
Entity identification problem: identify real world entities from multiple data
Schema Integration
Removing noise from data
Smoothing
scaled to fall within a small, specified range
Normalization
summarization
Aggregation
concept hierarchy climbing
Generalization
Normalization
- Min-max normalization
- Z-score normalization
- Normalization by decimal scaling
Obtains reduced representation in volume but produces the same or similar analytical results.
Data is too big to work with.
Data Reduction
Data Reduction Strategies
- Dimension reduction—remove unimportant attributes
- Aggregation and clustering
- Sampling
Feature selection (i.e., attribute subset selection):
Select a minimum set of attributes (features) that is sufficient for the data mining task.
Dimension Reduction
Popular reduction technique. Divide data into buckets and store average (sum) for each bucket
Histograms
Choose a representative subset of data
Sampling
Data discretion three types of attributes
- Nominal
- Ordinal
- Continuous
Data discretion three types of attributes
values from an unordered set
Nominal
Data discretion three types of attributes
values from an ordered set
Ordinal
Data discretion three types of attributes
real numbers
Continuous
Data discretion techniques
- Binning Method - equal-width, equal-frequency
- Entropy-based (1)
- Entropy-based (2)
for bin width of e.g., 10:
Equi-width binning
for bin density of e.g., 3
Equi-frequency binning
Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values
Discretization
Reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)
Concept Hierarchies