Semantic Analysis - Document Clustering Flashcards
Document Clustering
- Unsupervised ML
- Don’t have pre-defined categories, there are no labels
- Put similar documents together (e.g., recipes)
- Put documents into a given # of clusters
Classification
- Supervised ML
- Set of categories
Methods of clustering
All about distance metrics.
- Centroid based: Try to find where the center of centroids are and then which centroid each document belongs to.
- Hierarchical based: break documents into two groups, then divide in two groups, then take those and break into two groups. Usually better, but they are slower.
K-means clustering
- Most popular algorithm in NLP for centroid clustering
- random pick k documents
- use semantic similarity, document similarity
- make each document a cluster of centroid it is closest to
- calculate a centroid in the exact center
- documents get reassigned to cluster
- keep going until no centroid need to be moved and no documents are reassigned
- kmeans is fast compared to other clustering algorithms and has broad applications
- to scale to massive datasets (think Google), index the centroids. For every center, query the nearest points. Takes more time to index, but access is very fast. 50x improvement in speed.
Problems
- Typically guessing k
- weakness, use random seeds to start with
- unlucky seeding because of random seed
- Simple and easy to understand
- Final cluster config depends on initial centroid allocation
- Need to choose the value of k
- Convergence - can get stuck
Hierarchical Clustering
Agnes (agglomerative nesting) - one big cluster, every document as a cluster, merge documents together, keep merging until all documents are merged into a super cluster. Ward’s minimum variance. Way to merge clusters. Pick one when the result will create the least amount of variance within the documents within the cluster.
Diana (divisive analysis) - start with super cluster, then split, then split, etc. Look within one cluster, look for farthest outlier, doc least like the other and break into another cluster, then move documents that are more similar to it with that cluster. Then repeat. Keep dividing. Done when one doc per cluster, but can stop sooner. Stop when if you continue dividing it would be difficult to tell the difference between clusters.
- We can cut the dendrogram wherever we want.
- Don’t have to decide ahead of time how many clusters because you can cut at different points in the tree.
Principal Component Analysis
- Dimensionality reduction
- 2d lot even if there are 30000, etc. dimensions
Labelling clusters
- Top n features from TF-IDF values from centroid or average or all docs in a particular cluster
- List top 3 or 4 features to get a quick label
Cluster Cohesion
Measures how closely related objects are in a cluster
Cluster Separation
Measures how distinct or well-separated a cluster is from other clusters. Clusters should be as far apart as possible. If they aren’t far apart, then they should be one cluster.
F measure
Harmonic mean between precision and recall
Evaluating Cluster Quality
Internal measures
- Cluster Cohesion
- Cluster Separation
External measures
- Entropy
- F-measure
Entity Resolution
Identifying and linking/grouping different manifestations of the same real-world object, e.g., addressing the same person in text (can’t go just by the name), web pages of same business, etc.
Clustering on graphs
Querying over a graph (traversing a graph) is faster than joining many tables especially when talking about a trillion records.