Unsupervised Machine Learning Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Unlabeled Data

A

any data that’s not organized in an easily identifiable manner is known as unstructured/unlabeled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Goals of Unsupervised Learning

A

Goal is to learn about data’s underlying structure and find out how different features relate to each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Name 2 Methodologies of unsupervised learning

A
  1. Recommendation Systems
  2. K-mean models
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Briefly describe a Recommendation system

A

Recommendation systems are a
- subclass of machine learning algorithms that
- can be both supervised or unsupervised
- offer relevant suggestions to users.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the goal of a recommendation system?

A

to quantify how similar one thing is to another, and use this information to suggest a closely related option.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is content-based filtering?

A

Content-based filtering is a type of recommendation system where comparisons are made based on the attributes of the content itself.

For example, attributes of a song you played are compared to attributes of other songs to determine similarity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some benefits of content-based filtering?

A
  • The benefits include being easy to understand, recommending more of what a user likes,
  • not needing other users’ information to work, and - -
  • being able to map users and items in the same space to recommend things that are closest to a user’s typical preferences.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some drawbacks of content-based filtering?

A
  1. Always recommends more of the same
  2. Require manual input of attributes
  3. Cannot reccommend across content type
  4. Limited use cases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is collaborative filtering?

A

Collaborative filtering is a type of recommendation system that uses the likes and dislikes of users to make recommendations.

It does not need to know anything about the content itself. All that matters is if the user liked it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some benefits of collaborative filtering?

A

The benefits include the ability to
- recommend across content types,
- finding hidden correlations in the data, and
- not requiring tedious manual mapping.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some drawbacks of collaborative filtering?

A

Drawbacks include
- needing lots of data to even start getting useful results,
- requiring every user to give the system lots of data, and
- dealing with sparse data that has a lot of missing values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What type of model is K-means and what does it do?

A
  • unsupervised learning model
  • partitioning algorithm,
  • organize unlabeled data into clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a Centroid?

A
  • central point of a cluster
  • also known as the mathematical mean
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

List the 4 steps to build K-means model

A
  1. Initiate k centroids
  2. Assign all points to nearest centroid
  3. Recalculate the centroid of each cluster.
  4. Repeat Step 2 and 3 until the algorithm converges
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between Clustering and Partitioning Algorithms

A

clustering algorithms: outlying points can exist outside of the clusters.

partitioning algorithms: all points must be assigned to a cluster.

in other words, K-means does not allow unassigned outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is k in Initiate k centroids step?

A

K = the number of centroids in your model, which is how many clusters you’ll have.

17
Q

Who makes decision on k = #?

A

you

18
Q

How to choose k value?

A

sometimes known, for instance, if there’s 3 species of beetle to cluster, then k=3. sometimes unknown.

19
Q

Name 2 other clustering methodologies

A
  • DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Groups data points together based on their density.
  • Agglomerative clustering: Creates a hierarchy of clusters by merging data points or clusters iteratively
20
Q

What cluster shape does K-mean work best with?

A

round clusters

21
Q

DBSCAN (density-based spatial clustering of applications with noise)

A
  • searches your data space for continuous regions of high density.
  • Or find clusters based on density, the shape of the cluster isn’t as important as it is for K-means.
22
Q

DBSCAN Hyperparameters

A

epsilon, min_samples

23
Q

DBSCAN: eps, Epsilon (ε)

A

The radius of your search area from any given point

24
Q

DBSCAN: min_samples

A

the number of samples in an ε-neighborhood for a point to be considered a core point (including itself)

25
Q

Agglomerative clustering

A

works by first assigning every point to its own cluster, then progressively combining clusters based on intercluster distance.

26
Q

Agglomerative clustering requirement

A

you specify a desired number of clusters or a distance threshold, which is the linkage distance

27
Q

Agglomerative clustering: Linkage

A

different ways to measure the distances that determine whether or not to merge the clusters.

28
Q

Common Linkages

A

Single: The minimum pairwise distance between clusters. Complete: The maximum pairwise distance between clusters. Average: The distance between each cluster’s centroid and other clusters’ centroids. Ward: This is not a distance measurement. Instead, it merges the two clusters whose merging will result in the lowest inertia.

29
Q

When does Agglomerative clustering stop?

A
  1. You reach a specified number of clusters.
  2. You reach an intercluster distance threshold (clusters that are separated by more than this distance are too far from each other and will not be merged).
30
Q

Agglomerative clustering: Hyperaparameters

A

n_clusters: the number of clusters you want in your final model. linkage: the linkage method to use to determine which clusters to merge. affinity: the metric used to calculate the distance between clusters. Default = euclidean distance. distance_threshold: the distance above which clusters will not be merged

31
Q

Agglomerative clustering PROs

A

scales reasonably well, can detect clusters of various shapes.

32
Q

What is considered good clustering model?

A
  1. clearly identifiable clusters (within each cluster or intracluster, the points are close to each other)
  2. Each cluster is well separated from other clusters. (between the clusters themselves or intercluster, you want lots of empty space)
33
Q

K-means: metrics to evaluate good clusters

A
  1. Inertia
  2. Silhouette Score
34
Q

K-means: Inertia

A

Inertia is a metric used in K-Means clustering to measure the quality of the clustering.

It represents the average squared distance between each data point and its assigned cluster centroid

Lower Inertia, Better Clustering
The goal of K-Means is to minimize inertia.

A lower inertia indicates that the data points are more tightly clustered around their respective centroids, suggesting a better clustering solution.

35
Q

K-means: Silhouette Score

A
  • more precise evaluation metric than inertia because it also takes into account the separation between clusters.

Silhouette score is defined as the mean of the silhouette coefficients of all the observations in the model.

Provides insight as to what the optimal value for K should be, and uses both intracluster and intercluster measurements in its calculation

36
Q

Inertia Score

A
  • lower = better (less distance between each observation and its nearest centroid.
  • 0 = useless (all points are overlapping each other in the center).
37
Q

Inertia Score PROs

A
  • helps us to decide on the optimal k value.
  • We do this by using the elbow method.
38
Q

Elbow Method

A

Plot of inertia vs K-values (1,2,3…etc).

  • A good way of choosing an optimal k value is to find the elbow of the curve.
  • This is the value of k at which the decrease in inertia starts to level off.
39
Q

Explain Silhouette Scores (-1,0,1)

A

1 = optimal (an observation sit nicely within its own cluster and well separated from other clusters).

0 = an observation is on the boundary between clusters.

-1 = in the wrong cluster.