Terms Flashcards

Question 1

Q

Girvan-Newmann

Answer

A

Calculate betweenness in a network (using the leaf / parent scoring system)

Question 2

Q

Betweenness

Answer

A

How often is an edge used as a path between nodes?

Question 3

Q

Support (itemsets)

Answer

A

Number of baskets for which an itemset is a subset

Question 4

Q

Confidence (if I then j)

Answer

A

Support for I divided by the support for I and j together,. (If diapers then beer). If its 1 then its always the case

Question 5

Q

Interest (If I then j)

Answer

A

How interesting is a fact? If diapers then beer is not interesting if beer is in every basket. Confidence in I to j minus fraction of baskets that contain j

Question 6

Q

Support threshold

Answer

A

Find all frequent itemsets that appear in at least s baskets

Question 7

Q

What is a triangular matrix used for?

Answer

A

Efficient way to store count of pairs; (pair 2,1 count is the same as pair 1,2 count).

Question 8

Q

What does the A-priori algorithm do?

Answer

A

Find frequent K-itemsets that appear in at least s baskets. You count all pairs and then filter out the infrequent ones. You then construct new pairs (K + 1) and count again. Filter and repeat until your K is reached.

Question 9

Q

What does the PCY algorithm do?

Answer

A

Like A-priori but we reduce the amount pairs that need to be counted by hashing pairs on K=1 and counting. When constructing a candidate pair, only make it a candidate if the pair hashes to a ‘frequent bucket’.

Question 10

Q

What does the multistage algorithm do?

Answer

A

Like PCY, but a second hash round. First hash table contains a lot of false negatives (multiple pairs hash to same bucket and can have a count > s together). We generate each pair again, and iff it is in a frequent bucket in hash 1 then we hash it to hash 2. Then we count again

Question 11

Q

What does the multihash do?

Answer

A

Like multistage, but doesnt require the extra pass through the baskets. Has multiple hashes, and hashes pairs to all hashes. When counting, increment it in all hash tables. Then only construct candidate pair if it is in a frequent bucket in ALL hashes. (avoid false negatives)

Question 12

Q

Random Sampling (itemsets)

Answer

A

If too much data; we can pick a sample and run A-priori / PCY on it. We then validate the result against the entire dataset.

Question 13

Q

SON algorithm (itemsets)

Answer

A

We can divide the data up into parts. Then for each part, run A-priori / PCY. Then join the results and validate (count).

Question 14

Q

What is the monotonicity property?

Answer

A

An itemset cannot be frequent in the entire set if it is not frequent in at least one subset.

Question 15

Q

How does hierarchical clustering work?

Answer

A

Iteratively join the clusters with closest centroid. Stop when you are at K clusters.

Question 16

Q

How does K-means clustering work?

Answer

A

Pick a K. Then choose K points random, they are the clusters. For all points, place each point in the cluster thats closest. Then update the centroids of all clusters. Now for all points, reassign to closest cluster. Repeat this untill no more points move.

Question 17

Q

How do you pick the right K for clustering?

Answer

A

Power law; for all K we can map the avg distance to a centroid. There is a point where if we increase K, the avg distance doesn’t get much lower. That is the sweet spot.

Question 18

Q

What is the Mahalanobis distance?

Answer

A

We assume a normal distribution of distance to centroid of a cluster. The mahalanobis distance is the amount of standard deviations that a point has from a centroid.

Question 19

Q

What statistics are recorded in the BFR algorithm?

Answer

A

Number of points, sum of all points and sumsquared of all points.

Question 20

Q

What does the Discard Set in BFR hold?

Answer

A

The Discard Set holds all points that are close enough to an existing cluster.

Question 21

Q

What does the Compression Set in BFR hold?

Answer

A

All generated clusters that could not go into DS. If possible these clusters are merged.

Question 22

Q

What does the Retained Set (RS) in BFR hold?

Answer

A

All points that could not be assigned to a mini cluster for the CS in the previous round. They are joined with the retained set from this round to try and form mini-clusters. that go into CS

Question 23

Q

What is the limitation of BFR?

Answer

A

Since it assumes a gaussian distribution, only cigar or round distributions are recognized. Odd shapes in clusters are not recognized.

Question 24

Q

How does the CURE algorithm work?

Answer

A

Pick a large sample from data. Cluster using hierarchical clustering. For each cluster, pick n representative points. Shrink each point 20% towards its centroid. Now compare each point to all representative points. Assign it to the cluster that belongs to the repr point it’s closest to.