Midterm Flashcards

Question

Basic association rule process

Answer 1

1. Find all frequent itemsets - each of these itemsets must occur at least as frequently as predetermined by the minimum support count 2. Generate strong association rules from the frequent itemsets: These rules must satisfy the minimum support and minimum confidence

Answer 2

If there is any itemset which is infrequent, its superset should not be generated/tested - in other words, all subsets of a frequent itemset must be frequent

Answer 3

- scan dataset to get frequent 1-itemsets - generate length (k+1) candidate itemsets from length k frequent itemsets - test the candidates against dataset to obtain support counts - terminate when no frequent or candidate set can be generated

Answer 4

Data cleaning - Fill in missing values, smooth noisy data, identify or remove Data Integration - Integration of multiple databases, data streams or files Data reduction - Dimensionality reduction - Numerosity reduction Data transformation and data discretization - Normalization

Answer 5

incomplete, noisy, inconsistent

Answer 6

Ignore the record or fill it automatically with a constant like NA, the attribute mean, or the attribute mean for all samples belonging to the same class - the smartest approach

Answer 7

- Binning - first sort data and partition into equal frequency bins -Regression - smooth by fitting the data into regression funcitons - Clustering - detect and remove outliers - Combined computer and human inspection - detect suspicious values and manually check

Answer 8

Combining data from multiple sources into a coherent dataset Schema integration - integrate metadata from different sources

Answer 9

Object identification Derivable data Redundant attributes may be detected by correlation analysis and covariance analysis

Answer 10

chi-squared test SUM OF (O-E)^2/E The larger the X^2, the more likely the variables are related CORRELATION DOESNT IMPLY CAUSALITY

Answer 11

How much do attributes change together Positive covariance - If coA,B>) then A and B both tend to be larger than their expected values Negative covariance - If CovA,B<0 then A is larger than its expected value, B is likely to be smaller than its expected value Independence CovA,B=0

Answer 12

Obtain a reducted representation of the data set that is much smaller in volume, but produces the same analytical results

Answer 13

Scaling data to fall within a smaller, more specified range

Answer 14

Main technique for data reduction - Used because obtaining the entire set of data of interest is too expensive or time consuming - Sampling is typically used in data mining because processing the entire set of data of interest is too expensive or time consuming

Answer 15

- Simple random sampling - There is an equal probability of selecting any particular item - Sampling without replacement - Once an object is selected, it is removed from the population - Sampling with replacement - Stratified sampling - Partition data set and draw samples from each partition

Answer 16

When dimensionality increases, data becomes increasingly sparse in the space that it occupies

Answer 17

- The process of converting a continuous attribute into an ordinal attribute - A potentially infinite number of values are mapped into a small number of categories - Discretization is commonly used in classification

Answer 18

Partition based on set bin width, partition based on frequency in bin

Answer 19

Finds breaks in the data values

Answer 20

Uses class labels to find breaks

Answer 21

- Maps a continuous or categorical attribute into one or more binary values - Typically used for association analysis - continuous to categorical then categorical to binary - Association analysis needs asymmetric binary attributes

Midterm Flashcards

(45 cards)