Unsupervised Flashcards

Question 1

Q

NLP - steps

Answer

A

Normalize
- we need to make sure all the words follow the same standard
Tokenizing
- words that we extract from the document by splitting it, either by using punctuations as separator.
- we can also consider sentences as tokens
lower case
filtering stopwords (this may not yield better results, try removing and not removing to see which gives you better results)
stemming and lemmatization
- finding roots of the words. Stemming: using algorithm. Lemmatization: using dictionaries
N-grams
- look at groups of words together. N = number of words

Question 2

Q

NLP - indexing bag of words into a vector table

Answer

A

try to see which gives you a better result:

term frequency - look at a document and count how many times a key word is repeated
- from sklearn.feature_extraction.text import CountVectorizer
obtaining document frequencies - how many documents does the key word repeated
TFIDF - the higher TFIDF is, the more important is the word. Look at term frequency in each document and how often they appear in the corpus. Key words that appear in every document are less important.
- from sklearn.feature_extraction.text import TfidfVectorizer

Question 3

Q

NLP vocabs

Answer

A

corpus - whole wikipedia (whole database)

document - each article (sample)

Question 4

Q

Naive Bayes

Answer

A

in situations when we have high number of dimensional space (features) than data. Instead of calculating frequencies of keywords, it calculates probability of (ie) keywords in spam emails.

Question 5

Q

Clustering - k means

Answer

A

calculate distances from center of the cluster to all the points in the cluster, the small the overall distance is, the tighter the cluster is. The question is: where is the center of the cluster and how many clusters do we use.

choose the initial clusters
randomly pick centers of these clusters
data points are grouped with the smallest distance to those center points
then reassign centroids to the average center of these clusters
until our centroids are not changing anymore

Question 6

Q

Clustering - k means++

Answer

A

Similar to k-means but with an added weight. Once a random centroid is picked, the second point that is picked will be randomly but more likely the point farthest away will be chosen

Question 7

Q

clustering - how do we pick how many clusters?

Answer

A

the elbow method can be used as a rule of thumb: graph total sum of squares vs number of clusters - pick a point where we start to get diminishing returns.

Question 8

Q

silhouette scores

Answer

A

A measurement of how confidently the point is assigned to each cluster by calculating the distance between a data point with the current cluster, and the distance between a data point to the next nearest cluster. Choose the number of clusters with the highest average silhouette score - the higher the average silhouette score, the tighter and more separated the clusters

Question 9

Q

hierarchal clustering

Answer

A

Calculate distances with all data points and then group those that are closest together. It then repeats this step to reduce the number of clusters by expanding clusters to include the nearest points. Pro: we can graph the results and take control of how many clusters we want.

Question 10

Q

k means pros and cons

Answer

A

cons:

when there’s unequal cluster sizes, or if it has unequal density, k-means does not give a good boundary to clusters. Sometimes it divides a cluster arbitrary.
in the case of non-linear clustering, k-means might split the clusters linearly

Unsupervised Flashcards

(10 cards)