Association Flashcards by Chris Hahn

Define support and confidence formulas

Support(ABC) = o(ABC) / o(N) Confidence(AB->C) = o(ABC) / o(AB)

How well did you know this?

Not at all

Perfectly

What do you know about (AB->C), (A->BC), (AC->B) and (ABC -> C)

Top 3, numerators are all the same. Bottom, doesn’t make sense.

How well did you know this?

Not at all

Perfectly

Apriori Principle

If support of a superset is large, the subsets must be large. If a subset is small, the supersets must be small.

How well did you know this?

Not at all

Perfectly

Anti-monotone principle

This phenom is called what?

Use the books word and describe monotone and anti-monotone.

Similar if not identical to apriori, we know if we find all subsets that are “interesting”, we don’t have to worry about supersets being more interesting.

Upward closed.

Monotone: if a subset satisfies a constraint, so does supersets.

Anti-monotone: if a subset violates, so does supersets.

How well did you know this?

Not at all

Perfectly

How many database scans does an association using the Apriori method require?

Without apriori, As many as your largest k-item set. With apriori, as many as largest ‘interesting’ k-itemset.

How well did you know this?

Not at all

Perfectly

4 ways to Generate candidate k-itemsets.

Which require pruning?

Brute force: Generate every possible k-itemset. Prune.
SQL Join: Join the DB table of k-1 itemsets to itself. Join logic: first two items identical, third item different. Sounds familiar.
F_k-1 x F₁ : Combine each candidate k-1 item with each candidate F1 item.Avoid duplicates by ordering each set of sets into lexicographical order.Prune.
F_k-1 x F_k-1: Merge candidate F_k-1 sets only if thier first k-2 items are identical and the last item different.

All methods require pruning because all run the risk of generating too many candidates.

How well did you know this?

Not at all

Perfectly

An itemset is closed if…

An itemset is maximal if…

Closed: X is closed if all supersets have less support.

Closed sets will be saved. Non-closed sets can be discarded for convenience.

Maximal: X is maximal if none of its supeersets are frequent.

How well did you know this?

Not at all

Perfectly

Given, this list of closed frequent itemsets:

{C} = 0.3 , {A,C} = 0.2

{B,E} = 0.3 , {B,C,E} = 0.2

What do you know about:

E? EC? D? A? B?

E = .3

EC = .3

A = B = 0.2

D we don’t know.

How well did you know this?

Not at all

Perfectly

What benefit comes from knowing maximal itemsets?

From it, you can derive all frequent itemsets.

You don’t know their support, but you know they are “important”.

How well did you know this?

Not at all

Perfectly

Confidence based pruning:

Which of these do we care about?

ABC->D > threshold (pass)

ABC-D < threshhold

C->ABD > threshold

C->ABD < threshold

ABC->D under threshold.

We can prune all other sub-sets of ABC.

AB->CD, AC->BD, BC->AD is OUT.

How well did you know this?

Not at all

Perfectly

How to implement Multiple Minimum Supports?

What does this do to your monotonoisity?

Minimum support based on 1-itemsets.

MST(Coke) = 3%

MST(Diamonds) = 0.1%

MST(Coke, Diamonds) = min(mst(coke),mst(diamonds)) = 0.1%

Result is no longer anti-monotone.

How well did you know this?

Not at all

Perfectly

Give examples of constraints that are monotone and anti-monotone.

Sum and LessThan are anti-monotone

How well did you know this?

Not at all

Perfectly

What datatypes are common for attributes?

Boolean (binomail)

Discrete (binomial)

Integer like age

Ordinal - ordered

How well did you know this?

Not at all

Perfectly

Whats an eager classification method?

What is the other type, give examples of both

Eager methods produce the decision tree in advance.

Eager:

Hunt’s recursive decision tree algorithm.

Lazy:

Naive Bayes

How well did you know this?

Not at all

Perfectly

Name some preprocessing steps

Fusing daa from multiple sources

Cleaning: Remove duplicates and noise

Selecting relevant records

How well did you know this?

Not at all

Perfectly

Whats post processing

Study These Flashcards

Usually data visualization

Verify good data, statistical measures and hypothosys testing

“Ensures only valid and useful results enter Decision Support System

Classification Vocab.

Give descriptive terms for atttributes in classification.

Study These Flashcards

Target variable, is sometimes called dependent variable.

The other attributes are called explanitory or independent variables.

What are two major methods used in cluster analysis

Study These Flashcards

K-Means: (sounds similar to genetic programming)

Hierarchical: make like a family tree

Density Method

What is usually larger? Support ratio or Confidence Ratio?

Study These Flashcards

Confidence always.

Support = o(ab)/Total

Confidence = o(ab)/o(a)

Name some places where anti-monotonicity is handy

Study These Flashcards

Obviously, apriori support counting.

Rule generation (if abc->d doesn’t satisfy, then ab->cd won’t).

Candidate generation phase for Constraints Mining

Discuss why infrequent itemsets can be important.

Study These Flashcards

If X+Y is infrequent, especially if minsup is very low,

That makes it likely that mixed itemsets X-Y or Y-X or -Y-X is interesting.

Define lift, and how you screwed up the homework

Study These Flashcards

Lift of X->Y is

sup(XY)

sup(X) *sup(Y)

There’s an exttra N in there. sup(XY)and sup(X) both have an N, so they cancel, but sup(Y) has an N.

How is suppport and confidence effected as you go down a hirearchy,

say from Computer to Laptop to Dell

For Computer -> Printer

Study These Flashcards

Support decreases, there are less dells out there than printers.

Confidence increases, the denominator shrinks.

How does decreasing support threshold increase computational complexity?

Study These Flashcards

Two ways:

You get more frequent candidates, so more counting.

You get larger itemsets, so more passes over the data.

What 3 factors effect association computational complexity?

Number of items Number of transactions Average Transaction Width

3 factors we use to descrbe itemset generation algorithms (and more!)

Complete: finds all the itemsets Sound: finds only interesting itemsets. Efficient. Avoids generating same itemset twice.

Association Flashcards

(26 cards)