Unsupervised Flashcards
How to evaluate label predictions vs true labels using pandas?
pd.crosstab(df[‘preds’, ‘true’])
What is inertia?
How far points are from centroids
How to choose best number of clusters?
When inertia stops dropping quickly. Elbow point.
What is the problem with feature variance for kmeans?
Feature variance = feature influence so needs scaling.
What does StandardScaler do?
It standardizes features by removing the mean and scaling to unit variance.
What does Normalizer do?
It rescales each sample independently of the other.
What is the inkage method?
it defines how the distance between clusters is measured
What is the difference between single and complete linkages?
In complete linkage, the distance between clusters is the distance between the furthest points of the clusters. In single linkage, the distance between clusters is the distance between the closest points of the clusters.
How to extract hierarchical cluster labels at given height?
Using fcluster()
What is t-SNE?
T-distribued stochastic Neighbour Embedding
What are reasonable learning rate value for t-SNE?
50 to 200
What does PCA do?
PCA de-correlates the data by centering the mean to 0 and removing features with low variance (noisy) in order to keep informative features (high variance).
What is NMF?
Non-negative matrix factorization. Can only applied when values >= 0.
How can NMF be used for text classification?
NMF features are topics and documents are combinations of topics.
What is the difference between an outer and inner join?
An outer join is the union of indices while an inner join is the intersection of indices.