Unsupervised Flashcards
NLP - steps
- Normalize
- we need to make sure all the words follow the same standard - Tokenizing
- words that we extract from the document by splitting it, either by using punctuations as separator.
- we can also consider sentences as tokens - lower case
- filtering stopwords (this may not yield better results, try removing and not removing to see which gives you better results)
- stemming and lemmatization
- finding roots of the words. Stemming: using algorithm. Lemmatization: using dictionaries - N-grams
- look at groups of words together. N = number of words
NLP - indexing bag of words into a vector table
try to see which gives you a better result:
- term frequency - look at a document and count how many times a key word is repeated
- from sklearn.feature_extraction.text import CountVectorizer - obtaining document frequencies - how many documents does the key word repeated
- TFIDF - the higher TFIDF is, the more important is the word. Look at term frequency in each document and how often they appear in the corpus. Key words that appear in every document are less important.
- from sklearn.feature_extraction.text import TfidfVectorizer
NLP vocabs
corpus - whole wikipedia (whole database)
document - each article (sample)
Naive Bayes
in situations when we have high number of dimensional space (features) than data. Instead of calculating frequencies of keywords, it calculates probability of (ie) keywords in spam emails.
Clustering - k means
calculate distances from center of the cluster to all the points in the cluster, the small the overall distance is, the tighter the cluster is. The question is: where is the center of the cluster and how many clusters do we use.
- choose the initial clusters
- randomly pick centers of these clusters
- data points are grouped with the smallest distance to those center points
- then reassign centroids to the average center of these clusters
- until our centroids are not changing anymore
Clustering - k means++
Similar to k-means but with an added weight. Once a random centroid is picked, the second point that is picked will be randomly but more likely the point farthest away will be chosen
clustering - how do we pick how many clusters?
the elbow method can be used as a rule of thumb: graph total sum of squares vs number of clusters - pick a point where we start to get diminishing returns.
silhouette scores
A measurement of how confidently the point is assigned to each cluster by calculating the distance between a data point with the current cluster, and the distance between a data point to the next nearest cluster. Choose the number of clusters with the highest average silhouette score - the higher the average silhouette score, the tighter and more separated the clusters
hierarchal clustering
Calculate distances with all data points and then group those that are closest together. It then repeats this step to reduce the number of clusters by expanding clusters to include the nearest points. Pro: we can graph the results and take control of how many clusters we want.
k means pros and cons
cons:
- when there’s unequal cluster sizes, or if it has unequal density, k-means does not give a good boundary to clusters. Sometimes it divides a cluster arbitrary.
- in the case of non-linear clustering, k-means might split the clusters linearly