lec 7(done) Flashcards

1
Q

Data Reduction:

A

Obtain a reduced representation of the data set that is much smaller in volume yet produces the same (or almost the same) analytical results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why data reduction?

A

1-A database/data warehouse may store terabytes of data.

2-Complex data analysis/mining can take a very long time to run on the complete data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The computational time spent on data reduction should not outweigh or “erase” the time saved by mining on a reduced data set size.

true or false

A

true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Dimensionality Reduction

A

The process of reducing the number of random variables or attributes under consideration.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Numerosity Reduction

A

Replace the original data volume by alternative smaller forms of data representation

1-Parametric methods (store only the model parameters)
-Regression
2-Non-parametric methods
-Histograms, Clustering, Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data Compression

A

Transformations are applied to obtain a “compressed” representation of the original data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data Reduction Strategies

A

1-Dimensionality Reduction
2-Numerosity Reduction
3-Data Compression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Attribute Subset Selection

A

Reduces the data set size by detecting and removing redundant or irrelevant attributes to the mining task.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

how many possible subsets for n attributes

A

there are 2 to the power of n possible subsets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Basic heuristic methods of attribute subset selection:

A

1-Stepwise forward selection
2-Stepwise backward elimination
3-Combination of forward selection and backward elimination
4-Decision tree induction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Stepwise forward selection:

A

1-Starts with an empty set of attributes as the reduced set.
2-The best of the original attributes is added to the reduced set.
3-At each step, the best of the remaining original attributes is added to the set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Stepwise backward elimination:

A

1-Starts with the full set of attributes.

2-At each step, it removes the worst attribute remaining in the set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Combination of forward selection and backward elimination:

A

At each step, the procedure selects the best attribute and removes the worst from among the remaining attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Decision tree induction:

A

1-Decision tree algorithms were originally intended for classification.

2-When decision tree induction is used for attribute subset selection:
-A tree is constructed from the given data.

  • All attributes that do not appear in the tree are assumed to be irrelevant.
  • The set of attributes appearing in the tree form the reduced subset of attributes

also, check slide 8 :p

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Regression

A

Regression can be used to approximate the given data.

check slide 9

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Histograms

A

Divide data into buckets and store average (sum) for each bucket.

17
Q

histogram Partitioning Rules:

A

1-Equal-width: The width of each bucket range is uniform
2-Equal-depth (equal-frequency): Each bucket contains roughly the same number of data samples.
3-If each bucket represents only a single attribute-value/ frequency pair, the buckets are called singleton buckets.

18
Q

Clustering

A

Partition data set into clusters based on similarity and only store cluster representation (e.g., centroid and diameter)

Effective for data that can be organized into distinct clusters than for smeared data

19
Q

Sampling:

A

Obtaining a small data sample to represent the whole data set.

Choose a representative subset of the data
(Simple random sampling may have very poor performance in the presence of skew.)

20
Q

Types of Sampling:

A

Simple random sample without replacement
Simple random sample with replacement
Cluster sample
Stratified sample

21
Q

Simple random sampling without replacement (SRSWOR)

A

Once an object is selected, it is removed from the population

22
Q

Simple random sampling with replacement (SRSWR)

A

A selected object is not removed from the population

23
Q

Cluster sampling

A

Suitable if the tuples are grouped into mutually disjoint “clusters”

For example,

  • Tuples in a database are usually retrieved a page at a time, so that each page can be considered a cluster.
  • A reduced data representation can be obtained by applying, say, SRSWOR to the pages, resulting in a cluster sample of the tuples.
24
Q

Stratified sampling:

A

1-Partition the data set into partitions “strata”.
2-Obtain an SRS from each partition “stratum” (approximately the same percentage of the data).
3-Ensure a representative sample, especially when the data is skewed.
For example,
-A stratified sample may be obtained from customer data, where a partition is created for each customer age group.
-In this way, the age group having the smallest number of customers will be sure to be represented.

check slide 16

25
Q

Lossy vs. lossless data compression

A

check slide 17