lec 7(done) Flashcards

Question 1

Q

Data Reduction:

Answer

A

Obtain a reduced representation of the data set that is much smaller in volume yet produces the same (or almost the same) analytical results.

Question 2

Q

Why data reduction?

Answer

A

1-A database/data warehouse may store terabytes of data.

2-Complex data analysis/mining can take a very long time to run on the complete data set.

Question 3

Q

The computational time spent on data reduction should not outweigh or “erase” the time saved by mining on a reduced data set size.

true or false

Question 4

Q

Dimensionality Reduction

Answer

A

The process of reducing the number of random variables or attributes under consideration.

Question 5

Q

Numerosity Reduction

Answer

A

Replace the original data volume by alternative smaller forms of data representation

1-Parametric methods (store only the model parameters)
-Regression
2-Non-parametric methods
-Histograms, Clustering, Sampling

Question 6

Q

Data Compression

Answer

A

Transformations are applied to obtain a “compressed” representation of the original data.

Question 7

Q

Data Reduction Strategies

Answer

A

1-Dimensionality Reduction
2-Numerosity Reduction
3-Data Compression

Question 8

Q

Attribute Subset Selection

Answer

A

Reduces the data set size by detecting and removing redundant or irrelevant attributes to the mining task.

Question 9

Q

how many possible subsets for n attributes

Answer

A

there are 2 to the power of n possible subsets

Question 10

Q

Basic heuristic methods of attribute subset selection:

Answer

A

1-Stepwise forward selection
2-Stepwise backward elimination
3-Combination of forward selection and backward elimination
4-Decision tree induction

Question 11

Q

Stepwise forward selection:

Answer

A

1-Starts with an empty set of attributes as the reduced set.
2-The best of the original attributes is added to the reduced set.
3-At each step, the best of the remaining original attributes is added to the set.

Question 12

Q

Stepwise backward elimination:

Answer

A

1-Starts with the full set of attributes.

2-At each step, it removes the worst attribute remaining in the set.

Question 13

Q

Combination of forward selection and backward elimination:

Answer

A

At each step, the procedure selects the best attribute and removes the worst from among the remaining attributes.

Question 14

Q

Decision tree induction:

Answer

A

1-Decision tree algorithms were originally intended for classification.

2-When decision tree induction is used for attribute subset selection:
-A tree is constructed from the given data.

All attributes that do not appear in the tree are assumed to be irrelevant.
The set of attributes appearing in the tree form the reduced subset of attributes

also, check slide 8 :p

Question 15

Q

Regression

Answer

A

Regression can be used to approximate the given data.

check slide 9

Question 16

Q

Histograms

Answer

A

Divide data into buckets and store average (sum) for each bucket.

Question 17

Q

histogram Partitioning Rules:

Answer

A

1-Equal-width: The width of each bucket range is uniform
2-Equal-depth (equal-frequency): Each bucket contains roughly the same number of data samples.
3-If each bucket represents only a single attribute-value/ frequency pair, the buckets are called singleton buckets.

Question 18

Q

Clustering

Answer

A

Partition data set into clusters based on similarity and only store cluster representation (e.g., centroid and diameter)

Effective for data that can be organized into distinct clusters than for smeared data

Question 19

Q

Sampling:

Answer

A

Obtaining a small data sample to represent the whole data set.

Choose a representative subset of the data
(Simple random sampling may have very poor performance in the presence of skew.)

Question 20

Q

Types of Sampling:

Answer

A

Simple random sample without replacement
Simple random sample with replacement
Cluster sample
Stratified sample

Question 21

Q

Simple random sampling without replacement (SRSWOR)

Answer

A

Once an object is selected, it is removed from the population

Question 22

Q

Simple random sampling with replacement (SRSWR)

Answer

A

A selected object is not removed from the population

Question 23

Q

Cluster sampling

Answer

A

Suitable if the tuples are grouped into mutually disjoint “clusters”

For example,

Tuples in a database are usually retrieved a page at a time, so that each page can be considered a cluster.
A reduced data representation can be obtained by applying, say, SRSWOR to the pages, resulting in a cluster sample of the tuples.

Question 24

Q

Stratified sampling:

Answer

A

1-Partition the data set into partitions “strata”.
2-Obtain an SRS from each partition “stratum” (approximately the same percentage of the data).
3-Ensure a representative sample, especially when the data is skewed.
For example,
-A stratified sample may be obtained from customer data, where a partition is created for each customer age group.
-In this way, the age group having the smallest number of customers will be sure to be represented.

check slide 16

Question 25

Q

Lossy vs. lossless data compression

Answer

A

check slide 17