lec 5(done) Flashcards
Measures for data quality:
- Accuracy
- Completeness
- Consistency
- Timeliness
- Believability
- Interpretability
Major Tasks in Data Preprocessing:
- Data cleaning
- Data integration
- Data transformation
- Data reduction
Data cleaning include:
1-Fill in missing values
2-Smothing noise
3-identify or remove outliers
4-resolve inconsistencies
Data integration is:
Integration of multiple
1-databases
2- data cubes
3- files
Data reduction include:
1-Dimensionality reduction
2-Numerosity reduction
Data transformation include:
1-Normalization
2-Data discretization
3-Concept hierarchy generation
Real-world data tend to be :
1-Incomplete (missing): missing attribute values
2-Inaccurate (noisy): containing errors, or outliers
3-Inconsistent: containing discrepancies
Reasons for missing data
1-Information is not collected
2-Attributes may not be applicable to all cases
3-Inconsistent with other recorded data and thus deleted
4-Human/Hardware/Software problems
How to Handle Missing Data?
1-Ignore the tuple
2-Fill in the missing value manually: tedious + infeasible
3-Fill in the missing value automatically with:
o-A global constant : e.g., “unknown” or ∞
o-The attribute mean or median
o-The attribute mean for all samples belonging to the same class
o-The most probable value
Reasons for noisy data:
1-Faulty data collection instruments 2-Human errors at data entry 3-Data transmission problems 4-Technology limitation 5-Inconsistency in naming convention
Data Smoothing techniques
1-Binning: o-Smooth by bin means o-Smooth by bin medians o-Smooth by bin boundaries 2-Regression: o-Smooth by fitting the data into regression functions 3-Clustering o-Detect and remove outliers
Binning Methods steps:
1-Sort the data values
2-Partition data into
o-equal depth bins (equal-frequency) or
o-equal width bins, where interval range of values in
each bin is constant.
3-Smooth the data:
o-Smooth by bin means: replace each bin value by
the bin mean.
o-Smooth by bin medians: replace each bin value by
the bin median.
o-Smooth by bin boundaries: replace each bin value
with the closest boundary value (min, max).
how to apply regression
1-Linear Regression: finding the best line to fit two variables, so that one variable can be used to predict the other.
2-Multiple Linear Regression: more than two attributes are involved and the data are fit into a multidimensional surface.