lec 7(done) Flashcards
Data Reduction:
Obtain a reduced representation of the data set that is much smaller in volume yet produces the same (or almost the same) analytical results.
Why data reduction?
1-A database/data warehouse may store terabytes of data.
2-Complex data analysis/mining can take a very long time to run on the complete data set.
The computational time spent on data reduction should not outweigh or “erase” the time saved by mining on a reduced data set size.
true or false
true
Dimensionality Reduction
The process of reducing the number of random variables or attributes under consideration.
Numerosity Reduction
Replace the original data volume by alternative smaller forms of data representation
1-Parametric methods (store only the model parameters)
-Regression
2-Non-parametric methods
-Histograms, Clustering, Sampling
Data Compression
Transformations are applied to obtain a “compressed” representation of the original data.
Data Reduction Strategies
1-Dimensionality Reduction
2-Numerosity Reduction
3-Data Compression
Attribute Subset Selection
Reduces the data set size by detecting and removing redundant or irrelevant attributes to the mining task.
how many possible subsets for n attributes
there are 2 to the power of n possible subsets
Basic heuristic methods of attribute subset selection:
1-Stepwise forward selection
2-Stepwise backward elimination
3-Combination of forward selection and backward elimination
4-Decision tree induction
Stepwise forward selection:
1-Starts with an empty set of attributes as the reduced set.
2-The best of the original attributes is added to the reduced set.
3-At each step, the best of the remaining original attributes is added to the set.
Stepwise backward elimination:
1-Starts with the full set of attributes.
2-At each step, it removes the worst attribute remaining in the set.
Combination of forward selection and backward elimination:
At each step, the procedure selects the best attribute and removes the worst from among the remaining attributes.
Decision tree induction:
1-Decision tree algorithms were originally intended for classification.
2-When decision tree induction is used for attribute subset selection:
-A tree is constructed from the given data.
- All attributes that do not appear in the tree are assumed to be irrelevant.
- The set of attributes appearing in the tree form the reduced subset of attributes
also, check slide 8 :p
Regression
Regression can be used to approximate the given data.
check slide 9