Semantic Analysis - Document Clustering Flashcards

1
Q

Document Clustering

A
  • Unsupervised ML
  • Don’t have pre-defined categories, there are no labels
  • Put similar documents together (e.g., recipes)
  • Put documents into a given # of clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Classification

A
  • Supervised ML

- Set of categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Methods of clustering

A

All about distance metrics.

  • Centroid based: Try to find where the center of centroids are and then which centroid each document belongs to.
  • Hierarchical based: break documents into two groups, then divide in two groups, then take those and break into two groups. Usually better, but they are slower.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

K-means clustering

A
  • Most popular algorithm in NLP for centroid clustering
  • random pick k documents
  • use semantic similarity, document similarity
  • make each document a cluster of centroid it is closest to
  • calculate a centroid in the exact center
  • documents get reassigned to cluster
  • keep going until no centroid need to be moved and no documents are reassigned
  • kmeans is fast compared to other clustering algorithms and has broad applications
  • to scale to massive datasets (think Google), index the centroids. For every center, query the nearest points. Takes more time to index, but access is very fast. 50x improvement in speed.

Problems

  • Typically guessing k
  • weakness, use random seeds to start with
  • unlucky seeding because of random seed
  • Simple and easy to understand
  • Final cluster config depends on initial centroid allocation
  • Need to choose the value of k
  • Convergence - can get stuck
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Hierarchical Clustering

A

Agnes (agglomerative nesting) - one big cluster, every document as a cluster, merge documents together, keep merging until all documents are merged into a super cluster. Ward’s minimum variance. Way to merge clusters. Pick one when the result will create the least amount of variance within the documents within the cluster.

Diana (divisive analysis) - start with super cluster, then split, then split, etc. Look within one cluster, look for farthest outlier, doc least like the other and break into another cluster, then move documents that are more similar to it with that cluster. Then repeat. Keep dividing. Done when one doc per cluster, but can stop sooner. Stop when if you continue dividing it would be difficult to tell the difference between clusters.

  • We can cut the dendrogram wherever we want.
  • Don’t have to decide ahead of time how many clusters because you can cut at different points in the tree.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Principal Component Analysis

A
  • Dimensionality reduction

- 2d lot even if there are 30000, etc. dimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Labelling clusters

A
  • Top n features from TF-IDF values from centroid or average or all docs in a particular cluster
  • List top 3 or 4 features to get a quick label
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Cluster Cohesion

A

Measures how closely related objects are in a cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Cluster Separation

A

Measures how distinct or well-separated a cluster is from other clusters. Clusters should be as far apart as possible. If they aren’t far apart, then they should be one cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

F measure

A

Harmonic mean between precision and recall

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Evaluating Cluster Quality

A

Internal measures

  • Cluster Cohesion
  • Cluster Separation

External measures

  • Entropy
  • F-measure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Entity Resolution

A

Identifying and linking/grouping different manifestations of the same real-world object, e.g., addressing the same person in text (can’t go just by the name), web pages of same business, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Clustering on graphs

A

Querying over a graph (traversing a graph) is faster than joining many tables especially when talking about a trillion records.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly