Terms Flashcards

1
Q

Girvan-Newmann

A

Calculate betweenness in a network (using the leaf / parent scoring system)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Betweenness

A

How often is an edge used as a path between nodes?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Support (itemsets)

A

Number of baskets for which an itemset is a subset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Confidence (if I then j)

A

Support for I divided by the support for I and j together,. (If diapers then beer). If its 1 then its always the case

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Interest (If I then j)

A

How interesting is a fact? If diapers then beer is not interesting if beer is in every basket. Confidence in I to j minus fraction of baskets that contain j

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Support threshold

A

Find all frequent itemsets that appear in at least s baskets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a triangular matrix used for?

A

Efficient way to store count of pairs; (pair 2,1 count is the same as pair 1,2 count).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does the A-priori algorithm do?

A

Find frequent K-itemsets that appear in at least s baskets. You count all pairs and then filter out the infrequent ones. You then construct new pairs (K + 1) and count again. Filter and repeat until your K is reached.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does the PCY algorithm do?

A

Like A-priori but we reduce the amount pairs that need to be counted by hashing pairs on K=1 and counting. When constructing a candidate pair, only make it a candidate if the pair hashes to a ‘frequent bucket’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the multistage algorithm do?

A

Like PCY, but a second hash round. First hash table contains a lot of false negatives (multiple pairs hash to same bucket and can have a count > s together). We generate each pair again, and iff it is in a frequent bucket in hash 1 then we hash it to hash 2. Then we count again

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does the multihash do?

A

Like multistage, but doesnt require the extra pass through the baskets. Has multiple hashes, and hashes pairs to all hashes. When counting, increment it in all hash tables. Then only construct candidate pair if it is in a frequent bucket in ALL hashes. (avoid false negatives)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Random Sampling (itemsets)

A

If too much data; we can pick a sample and run A-priori / PCY on it. We then validate the result against the entire dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

SON algorithm (itemsets)

A

We can divide the data up into parts. Then for each part, run A-priori / PCY. Then join the results and validate (count).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the monotonicity property?

A

An itemset cannot be frequent in the entire set if it is not frequent in at least one subset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does hierarchical clustering work?

A

Iteratively join the clusters with closest centroid. Stop when you are at K clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does K-means clustering work?

A

Pick a K. Then choose K points random, they are the clusters. For all points, place each point in the cluster thats closest. Then update the centroids of all clusters. Now for all points, reassign to closest cluster. Repeat this untill no more points move.

17
Q

How do you pick the right K for clustering?

A

Power law; for all K we can map the avg distance to a centroid. There is a point where if we increase K, the avg distance doesn’t get much lower. That is the sweet spot.

18
Q

What is the Mahalanobis distance?

A

We assume a normal distribution of distance to centroid of a cluster. The mahalanobis distance is the amount of standard deviations that a point has from a centroid.

19
Q

What statistics are recorded in the BFR algorithm?

A

Number of points, sum of all points and sumsquared of all points.

20
Q

What does the Discard Set in BFR hold?

A

The Discard Set holds all points that are close enough to an existing cluster.

21
Q

What does the Compression Set in BFR hold?

A

All generated clusters that could not go into DS. If possible these clusters are merged.

22
Q

What does the Retained Set (RS) in BFR hold?

A

All points that could not be assigned to a mini cluster for the CS in the previous round. They are joined with the retained set from this round to try and form mini-clusters. that go into CS

23
Q

What is the limitation of BFR?

A

Since it assumes a gaussian distribution, only cigar or round distributions are recognized. Odd shapes in clusters are not recognized.

24
Q

How does the CURE algorithm work?

A

Pick a large sample from data. Cluster using hierarchical clustering. For each cluster, pick n representative points. Shrink each point 20% towards its centroid. Now compare each point to all representative points. Assign it to the cluster that belongs to the repr point it’s closest to.