Clustering Flashcards

1
Q

What is generalisation in machine learning

A

Generalisation in machine learning refers to the ability of a trained model to make accurate predictions on unseen data, i.e. data that the model has not encountered during training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is evaluating model performance on the training data problematic

A

Evaluating model performance on the training data is problematic because it can lead to overfitting, where the model becomes too complex and adapts too well to the training data, resulting in poor preformance on new, unseen data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is bias in machine learning

A

Bias refers to the difference between the mode’s predictions and the true values or measurements. A model with high bias tends to underfit the data, meaning it is not complex enough to capture patterns in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is variance in machine learning

A

Variance refers to the variability or spread of the model’s predictions in contrast to the true values or measurements. A model with high variance tends to overfit the data, meaning it is too complex and captures noise in the training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do bias and variance affect model performance

A

Bias and variance affect model performance by creating a trade-off between underfitting and overfitting. Models with high bias tend to underfit the data and have poor performance on both the training and test data, while models with high variance tend to overfit the data and have excellent performance on the training data but poor performance on the test data. Therefore, it is essential to strike a balance between bias and variance to achieve optimal model performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is model complexity in machine learning

A

Model complexity refers to the level of sophistication or intricacy of the model in capturing the patterns or relationships in the data. A more complex model may have more parameters or features and represent more complex functions, while a simpler model has fewer parameters and features and represents simpler functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the average percentages of training data and test data used for models

A

80% training data and 20% test data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is unsupervised learning in machine learning?

A

Unsupervised learning is a type of machine learning where the data is unlabelled and untagged, and the goal is to find patterns or structure in the data without any guidance or supervision.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is dimensionality reduction in unsupervised learning?

A

Dimensionality reduction is a technique in unsupervised learning that reduces the number of features or variables in the data while preserving most of the relevant information. Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) are examples of dimensionality reduction techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are autoencoders in unsupervised learning?

A

Autoencoders are neural networks used in unsupervised learning to learn compressed representations of the data by encoding the data into a lower-dimensional space and decoding it back to its original form. Autoencoders can be used for dimensionality reduction, data compression, and image denoising.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is clustering in unsupervised learning?

A

Clustering is a technique in unsupervised learning used to group a set of objects into clusters based on their similarities or dissimilarities. Exclusive or non-overlapping clustering techniques assign each object to only one cluster, while overlapping clustering techniques allow objects to belong to multiple clusters. Hierarchical and probabilistic clustering are examples of clustering techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are association rules in unsupervised learning?

A

Association rules are used in unsupervised learning to discover interesting relationships or patterns in the data. They are used in market basket analysis to find correlations between items purchased together and recommend items to customers based on their purchase history.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the notion of distance in machine learning?

A

The notion of distance in machine learning is used to measure the dissimilarity or similarity between objects based on their features or characteristics. The distance can be measured in various units, such as Euclidean distance, Manhattan distance, or cosine similarity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do we use distance to differentiate objects?

A

We use distance to differentiate objects by measuring the differences in their features or characteristics. For instance, an orange and a lime are both round, but the lime is smaller, so we can measure the distance between them in units of radius, such as centimeters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can we represent a binary variable using distance?

A

We can represent a binary variable using distance by assigning a distance of 0 if the variable is present and 1 if the variable is absent. For instance, a pepper is not hollow, so we can assign a distance of 0, while a bell pepper is hollow, so we can assign a distance of 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can we represent a continuous variable using distance?

A

We can represent a continuous variable using distance by measuring the differences between the values of the variable for different objects. For instance, we can measure the volume of empty space inside a pepper and assign a distance based on the differences between the volumes for different peppers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are some examples of distance metrics used in machine learning?

A

Examples of distance metrics used in machine learning include Euclidean distance, Manhattan distance, cosine similarity, and Jaccard distance. These metrics measure the distance or dissimilarity between objects based on their features or characteristics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are vector operations in machine learning?

A

Vector operations in machine learning involve performing mathematical operations on vectors, which are arrays of numbers or values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the basis for conducting vector operations?

A

Vector operations are conducted on a component-by-component basis. This means that each component or element of the vectors is treated independently and the operations are performed on them separately.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the purpose of performing vector operations?

A

The purpose of performing vector operations is to manipulate the data contained in the vectors to perform various mathematical or statistical analyses, such as calculating the mean, variance, or correlation between vectors.

21
Q

What are some examples of vector operations?

A

Examples of vector operations include vector addition, subtraction, multiplication, and division. Other operations include dot product, cross product, and projection.

22
Q

How are vector operations used in machine learning?

A

Vector operations are used in machine learning for various purposes, such as data preprocessing, feature engineering, and model training. They are used to manipulate and transform the data to make it suitable for analysis and modeling. For instance, in deep learning, vector operations are used extensively to manipulate the weights and biases of the neural network.

23
Q

What is K-means clustering?

A

K-means clustering is a popular unsupervised learning algorithm used to group data points into clusters based on their similarities. It is a centroid-based algorithm that iteratively assigns each data point to a cluster based on its distance from the centroid of that cluster.

24
Q

What is the central concept of clustering?

A

The central concept of clustering is that objects in a cluster must be similar to each other based on some predefined similarity metric or distance measure.

25
Q

How does K-means clustering work?

A

K-means clustering works by randomly assigning each data point to a cluster and then iteratively updating the centroids of the clusters until convergence is reached. The algorithm minimizes the sum of the distances of each data point to the centroid of its assigned cluster. This process is repeated until the centroids no longer change or a predetermined number of iterations is reached.

26
Q

What is the objective of K-means clustering?

A

The objective of K-means clustering is to partition the data into K clusters such that the sum of the distances of each data point to the centroid of its assigned cluster is minimized. The value of K is chosen by the user and represents the number of clusters desired.

27
Q

What are some advantages of K-means clustering?

A

Some advantages of K-means clustering include its simplicity and efficiency, making it easy to implement and scalable to large datasets. It is also widely used in various fields such as marketing, biology, and computer science for data analysis and pattern recognition.

28
Q

What is the purpose of scaling/normalizing the data in this algorithm?

A

Scaling/normalizing the data is important in this algorithm to ensure that all features are on the same scale and that no single feature dominates the clustering process. This helps to prevent bias towards certain features and allows for a more accurate and unbiased clustering.

29
Q

How is the value of K chosen in this algorithm?

A

The value of K is chosen based on the number of clusters desired or the number of distinct groups in the data. In this case, the value of K was chosen as 2 based on the visual inspection of the data.

30
Q

How are centroids selected in this algorithm?

A

Centroids are selected at random from the dataset if this is the first iteration of the algorithm. In subsequent iterations, the centroids are updated based on the mean of the data points assigned to each cluster.

31
Q

What is the purpose of associating each point to the nearest centroid?

A

The purpose of associating each point to the nearest centroid is to assign each data point to its closest cluster and to form initial clusters based on the location of the centroids.

32
Q

How are centroids updated in this algorithm?

A

Centroids are updated by taking the mean (vector average) of the data points assigned to each cluster. This generates a new proposition for each centroid, which is used in the next iteration of the algorithm.

33
Q

What are the steps for this algorithm (Check Notes)

A

Choose the number of clusters, K, to create.
Initialize K cluster centroids (center points) either randomly or based on some heuristics.
Assign each data point to the nearest centroid. The most common distance metric used is Euclidean distance.
Update each centroid by calculating the mean of all data points assigned to it.
Repeat steps 3-4 until convergence (i.e. centroids no longer move or a maximum number of iterations is reached).
Optionally, evaluate the quality of the clustering using a clustering metric such as silhouette score or inertia.

34
Q

What is the Confusion matrix based Fowles-Mallows score (FMI)?

A

The Confusion matrix based Fowles-Mallows score (FMI) is a performance evaluation metric defined as the geometric mean of the pairwise and recall.

35
Q

In what type of problem would we not know the class labels?

A

In a true clustering problem, we would not know the class labels.

36
Q

What technique should we use if we have the labels in a clustering problem?

A

If we have the labels in a clustering problem, we should deploy a classification technique.

37
Q

What is the Silhouette Coefficient used for?

A

The Silhouette Coefficient is used for evaluating the quality of clusters when the true class labels are unknown.

38
Q

What does the Silhouette Coefficient measure?

A

The Silhouette Coefficient measures how well each data point fits into its assigned cluster based on both the mean intra-cluster distance (a) and the mean distance between a sample and all other points in the next nearest cluster (b).

39
Q

What are the advantages of using the Silhouette Coefficient?

A

The Silhouette Coefficient is advantageous because it can detect if the clustering is incorrect, if the clusters are overlapping, or if the clusters are highly dense and well-separated. The score is also conceptually sound, as a higher score indicates a more optimal clustering.

40
Q

What is the Silhouette Coefficient formula

A

The formula for Silhouette Coefficient is:

s(i) = (b(i) - a(i)) / max{a(i), b(i)}

where:

s(i) is the Silhouette Coefficient for the i-th data point
a(i) is the mean distance between the i-th data point and all other points in the same cluster
b(i) is the mean distance between the i-th data point and all other points in the next nearest cluster.

41
Q

What is the inertia metric for determining the best K in clustering?

A

Inertia measures the sum of squared distances of all samples to their closest cluster center, and can be used to evaluate the quality of clustering. A lower inertia value generally indicates better clustering, but it may not be the best metric for all cases.

42
Q

How can the silhouette score be used to determine the best K in clustering?

A

The silhouette score measures how similar an object is to its own cluster compared to other clusters. It ranges from -1 to 1, with higher scores indicating better clustering. To determine the best K, one can plot the silhouette scores for different K values and select the K with the highest average score.

43
Q

What are some other metrics that can be used to determine the best K in clustering?

A

Other metrics that can be used include the Calinski-Harabasz index, Davies-Bouldin index, and Gap statistic. These metrics can help evaluate the quality of clustering based on different criteria such as compactness, separation, and cluster size. It’s important to select the appropriate metric based on the data and clustering goals.

44
Q

How can we determine the best value of K using the elbow method?

A

To determine the best value of K using the elbow method, we compute the inertia for different values of K, ranging from 2 to a high number (up to as many data points we have). Then, we plot the inertia against K and look for the point where the inertia starts to decrease at a slower rate, forming an elbow shape. The value of K at the elbow point is considered to be the best value for K.

45
Q

What is the main idea behind agglomerative/hierarchical clustering?

A

The main idea is to group the nearest points in their clusters, and then recursively merge these clusters until all points are in a single cluster.

46
Q

What is a dendrogram in the context of agglomerative/hierarchical clustering?

A

A dendrogram is a tree-like diagram that shows the order in which clusters are merged. It displays the distance between each pair of clusters and can be used to determine the optimal number of clusters.

47
Q

What are the two types of agglomerative/hierarchical clustering?

A

The two types are divisive and agglomerative. Divisive clustering involves starting with all points in one cluster and recursively dividing them into smaller clusters. Agglomerative clustering involves starting with each point in its own cluster and then merging the nearest clusters.

48
Q

What is the linkage criterion in agglomerative/hierarchical clustering?

A

The linkage criterion is a rule for determining the distance between two clusters. There are several different linkage criteria, including single linkage, complete linkage, and average linkage.