lec 7(done) Flashcards
Data Reduction:
Obtain a reduced representation of the data set that is much smaller in volume yet produces the same (or almost the same) analytical results.
Why data reduction?
1-A database/data warehouse may store terabytes of data.
2-Complex data analysis/mining can take a very long time to run on the complete data set.
The computational time spent on data reduction should not outweigh or “erase” the time saved by mining on a reduced data set size.
true or false
true
Dimensionality Reduction
The process of reducing the number of random variables or attributes under consideration.
Numerosity Reduction
Replace the original data volume by alternative smaller forms of data representation
1-Parametric methods (store only the model parameters)
-Regression
2-Non-parametric methods
-Histograms, Clustering, Sampling
Data Compression
Transformations are applied to obtain a “compressed” representation of the original data.
Data Reduction Strategies
1-Dimensionality Reduction
2-Numerosity Reduction
3-Data Compression
Attribute Subset Selection
Reduces the data set size by detecting and removing redundant or irrelevant attributes to the mining task.
how many possible subsets for n attributes
there are 2 to the power of n possible subsets
Basic heuristic methods of attribute subset selection:
1-Stepwise forward selection
2-Stepwise backward elimination
3-Combination of forward selection and backward elimination
4-Decision tree induction
Stepwise forward selection:
1-Starts with an empty set of attributes as the reduced set.
2-The best of the original attributes is added to the reduced set.
3-At each step, the best of the remaining original attributes is added to the set.
Stepwise backward elimination:
1-Starts with the full set of attributes.
2-At each step, it removes the worst attribute remaining in the set.
Combination of forward selection and backward elimination:
At each step, the procedure selects the best attribute and removes the worst from among the remaining attributes.
Decision tree induction:
1-Decision tree algorithms were originally intended for classification.
2-When decision tree induction is used for attribute subset selection:
-A tree is constructed from the given data.
- All attributes that do not appear in the tree are assumed to be irrelevant.
- The set of attributes appearing in the tree form the reduced subset of attributes
also, check slide 8 :p
Regression
Regression can be used to approximate the given data.
check slide 9
Histograms
Divide data into buckets and store average (sum) for each bucket.
histogram Partitioning Rules:
1-Equal-width: The width of each bucket range is uniform
2-Equal-depth (equal-frequency): Each bucket contains roughly the same number of data samples.
3-If each bucket represents only a single attribute-value/ frequency pair, the buckets are called singleton buckets.
Clustering
Partition data set into clusters based on similarity and only store cluster representation (e.g., centroid and diameter)
Effective for data that can be organized into distinct clusters than for smeared data
Sampling:
Obtaining a small data sample to represent the whole data set.
Choose a representative subset of the data
(Simple random sampling may have very poor performance in the presence of skew.)
Types of Sampling:
Simple random sample without replacement
Simple random sample with replacement
Cluster sample
Stratified sample
Simple random sampling without replacement (SRSWOR)
Once an object is selected, it is removed from the population
Simple random sampling with replacement (SRSWR)
A selected object is not removed from the population
Cluster sampling
Suitable if the tuples are grouped into mutually disjoint “clusters”
For example,
- Tuples in a database are usually retrieved a page at a time, so that each page can be considered a cluster.
- A reduced data representation can be obtained by applying, say, SRSWOR to the pages, resulting in a cluster sample of the tuples.
Stratified sampling:
1-Partition the data set into partitions “strata”.
2-Obtain an SRS from each partition “stratum” (approximately the same percentage of the data).
3-Ensure a representative sample, especially when the data is skewed.
For example,
-A stratified sample may be obtained from customer data, where a partition is created for each customer age group.
-In this way, the age group having the smallest number of customers will be sure to be represented.
check slide 16
Lossy vs. lossless data compression
check slide 17