L11 - Unsupervised Learning Flashcards
What is the goal of unsupervised learning?
- To identify patters in unseen data.
Give some examples of objectives that can be achieved with unsupervised learning…
- Identify new animal species, customer segmentation, identifying fraudulent activity.
Unsupervised learning is used for clustering tasks, explain how this is done…
- Iterate all points in data, establishing distance metrics between one another. Clusters can be created from data points that are closer to one another.
Unsupervised learning is used for community detection, explain what this is and how it’s done…
- A community is a group of interconnected nodes. Nodes that share more connections have a higher connection strength. E.g. Community of school friends on facebook will be strong due to many mutual friendships.
Unsupervised modelling is used for topic modelling, explain what this is and how it’s done…
- Topic modelling identifies topics and common themes in a data set. This can be done through methods such as word embedding using lemma or stems words.
Give some examples of clustering algorithms…
K means -> Identifies points close to K centroids where K is a hyper parameter given by the user.
DBSCAN -> Density Based Spatial Clustering of Applications with Noise. Finds high density regions, and creates cluster by expanding outwards.
Hierarchical Clustering -> Repeatedly divide clusters into sub-clusters.
What are the 2 types of clustering algorithms? Define each…
Hard Clustering -> Each data belongs to 1 cluster and only 1 cluster. Used when we want to make a definite decision on the data. I.e data can’t belong to multiple classifications. e.g data is either in A or B or C.
Soft Clustering -> Data can be assigned to multiple clusters.
What is a common similarity / distance metric used for clustering?
- Euclidean distance ( L2 norm )
When do we use Jaccard Similarity? How is it calculated?
We use Jaccard Similarity when we want to establish the similarity between 2 sets. It’s calculated by the number of intersection points of the sets divided by the number of union data point of the sets.
The Jaccard Distance = 1 - Jaccard Similarity.
How do we calculate Jaccard Distance?
- Jaccard Distance = 1 - Jaccard Similarity.