Semantic Analysis - Document Clustering Flashcards

Question 1

Q

Document Clustering

Answer

A

Unsupervised ML
Don’t have pre-defined categories, there are no labels
Put similar documents together (e.g., recipes)
Put documents into a given # of clusters

Question 2

Q

Classification

Answer

A

Supervised ML

- Set of categories

Question 3

Q

Methods of clustering

Answer

A

All about distance metrics.

Centroid based: Try to find where the center of centroids are and then which centroid each document belongs to.
Hierarchical based: break documents into two groups, then divide in two groups, then take those and break into two groups. Usually better, but they are slower.

Question 4

Q

K-means clustering

Answer

A

Most popular algorithm in NLP for centroid clustering
random pick k documents
use semantic similarity, document similarity
make each document a cluster of centroid it is closest to
calculate a centroid in the exact center
documents get reassigned to cluster
keep going until no centroid need to be moved and no documents are reassigned
kmeans is fast compared to other clustering algorithms and has broad applications
to scale to massive datasets (think Google), index the centroids. For every center, query the nearest points. Takes more time to index, but access is very fast. 50x improvement in speed.

Problems

Typically guessing k
weakness, use random seeds to start with
unlucky seeding because of random seed
Simple and easy to understand
Final cluster config depends on initial centroid allocation
Need to choose the value of k
Convergence - can get stuck

Question 5

Q

Hierarchical Clustering

Answer

A

Agnes (agglomerative nesting) - one big cluster, every document as a cluster, merge documents together, keep merging until all documents are merged into a super cluster. Ward’s minimum variance. Way to merge clusters. Pick one when the result will create the least amount of variance within the documents within the cluster.

Diana (divisive analysis) - start with super cluster, then split, then split, etc. Look within one cluster, look for farthest outlier, doc least like the other and break into another cluster, then move documents that are more similar to it with that cluster. Then repeat. Keep dividing. Done when one doc per cluster, but can stop sooner. Stop when if you continue dividing it would be difficult to tell the difference between clusters.

We can cut the dendrogram wherever we want.
Don’t have to decide ahead of time how many clusters because you can cut at different points in the tree.

Question 6

Q

Principal Component Analysis

Answer

A

Dimensionality reduction

- 2d lot even if there are 30000, etc. dimensions

Question 7

Q

Labelling clusters

Answer

A

Top n features from TF-IDF values from centroid or average or all docs in a particular cluster
List top 3 or 4 features to get a quick label

Question 8

Q

Cluster Cohesion

Answer

A

Measures how closely related objects are in a cluster

Question 9

Q

Cluster Separation

Answer

A

Measures how distinct or well-separated a cluster is from other clusters. Clusters should be as far apart as possible. If they aren’t far apart, then they should be one cluster.

Question 10

Q

F measure

Answer

A

Harmonic mean between precision and recall

Question 11

Q

Evaluating Cluster Quality

Answer

A

Internal measures

Cluster Cohesion
Cluster Separation

External measures

Entropy
F-measure

Question 12

Q

Entity Resolution

Answer

A

Identifying and linking/grouping different manifestations of the same real-world object, e.g., addressing the same person in text (can’t go just by the name), web pages of same business, etc.

Question 13

Q

Clustering on graphs

Answer

A

Querying over a graph (traversing a graph) is faster than joining many tables especially when talking about a trillion records.

Semantic Analysis - Document Clustering Flashcards

(13 cards)