Week 7 - Machine Learning II Flashcards

1
Q

Define Supervise Learning

A

Supervised learning takes a set of input features (predictors) and output (target) and learns a mapping function between input and output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is the training for supervised learning done?

A

Supervised learning is trained by using data with labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Is supervised learning always possible? Why? Reason it out.

A

No, supervised learning is not always possible. It is because often we do not have labelled datasets and we may have little or no idea what the labels should be

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is unsupervised learning type 1?

A

Clustering = It is when you group data with similar characteristics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is unsupervised learning type 2?

A

Association (Data Mining) = given a set of transactions, find rules that predict the occurrence of other items in the transaction

Dimensionality reduction = Reducing the dimensions/features of the data (e.g., PCA)

*PCA is principal component analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is clustering?

A

Clustering organises data that share similar characteristics into groups.

Every data point (observation) in a group is more similar to other data points in the same group than it is to data points in other groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the concept of grouping data points in clustering?

A

1 - Data points in the same group should be more similar
2- Data points in the different groups should be as dissimilar as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can you measure distance in clustering and what is the effectivity ?

A

Effective clustering requires computing di/similarity. So the distance here can be used to add data points to the closest cluster.

The distance metrics used in ML is Euclidean. But others are also possible: Manhattan, cosine, Minkowski, hamming and chebsyhev

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can you apply clustering to real-world applications?

A
  1. Customer segmentation - customers can be categorised into distinct groups, each group exhibiting certain characteristics (purchasing behaviour, demographics or preferences)
  2. Anomaly detection - network monitoring for intrusions, bank transactions for fraudulent transactions, insurance fraud detection and other anomalous behaviour
  3. Social Network Analysis - grouping different type of communities
  4. Web search results - returning similar text, images and videos
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can you apply clustering in SCS?

A
  1. Unsupervised learning approach for automatically categorizing potential suicide messages in social media
  2. Clustering is used to group social media users who are highly similar over the available features studied
  3. Using K-means analyse common characteristics of criminal actors by splitting them into groups using K-means clustering
  4. Classify malware using clustering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can you cluster algorithms?

A

Clustering algorithms can work by partitioning-based, hierarchical-based, density-based, grid-based and model-based)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is K-means clustering?

A

A machine learning algorithm that groups unlabeled data points into clusters based on their features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to assign the data points to two clusters (K=2) using the k-means algorithm?

A
  1. Randomly initialise k cluster centroids
  2. Assign every data point to its closest cluster centroid
  3. Move each cluster centroid to the mean of the data points assigned to it
  4. Assign every data point to its closest (new) cluster centroid
  5. Move each cluster centroid to the mean of the data points assigned to it
  6. Repeat step 4 and 5 until no change (centroid position is not changing or we have reached the maximum number of steps)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is supervised data?

A

Supervised data is when each data point (observation) has two features/predictors and a target/label variable with two outcomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between a supervised and unsupervised case?

A

An unsupervised case has no labels given unlike supervised.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is cluster assignment?

A

The step where you scale the data to standardise the x and y to the same scale

17
Q

Define the variable of each code below:

unsup_model <- kmeans (data, centers = 4, nstart = 10, iter.max = 10)

A

data - training data containing the numeric data frame

centers - number of clusters or cluster centers

nstart - number of time the k-means algorithm will be run with different centres

iter.max - maximum number of iterations allowed

18
Q

What is the process of Iteration 1 for K-means?

A
  1. Randomly initialise k=4 cluster centroids
  2. Assign every data points to its closest cluster centroid
  3. Move each cluster centroid to the mean of the data points assigned to it
  4. Assign every data point to its closest (new) cluster centroid
19
Q

How can you move the cluster centroids in K-means?

A

You can move each cluster centroid to the mean of the data points assigned to it

20
Q

How do you assign data points to its closest cluster centroid?

A

Using the Euclidian distance metric to measure the distance between data points

21
Q

How can you position new centroids?

A

The position of new centroids are through moving each cluster centroid to the mean of the data points assigned to it

22
Q

What clustering Iteration 2?

A

Clustering iteration 2 is when you repeat the process. So this occurs when you assign every data point to its closest cluster centroid using the new centroid position.

23
Q

Do you need to re-compute for iteration2 ?

A

No need to re-compute new cluster centroids since cluster elements have not changed. This will cause K-means to stop

24
Q

What are the stopping conditions used to stop K-means ?

A
  1. You stop when the data points remain in the assigned clusters
  2. Centroids of newly formed clusters do not change
  3. Maximum number of iterations (iter.max = 10) is reached
  4. Distance threshold reached where the minimum distance of data points from their centroids
25
Q

How can you determine ‘k’ in K-means?

A

K-means requires the user to specify ‘k’; though determining the optimal number of clusters in the data is fundamental; but there is no definite solution.

A common approach would be to run ‘k’ for different values of k = 1, k = 2, … and then you assess how good k is using within sum of squares (WSS) with elbow curve

26
Q

How can you determine ‘k’ within (cluster) sum of squares (WCSS or WSS)?

A

WSS for a single cluster (WSSi) is the summed squared distance between each data point and the cluster’s centroid

Hence, the total within the sum of squares is the sum of all WSSi of all clusters (e.g., cluster 1, cluster 2, cluster 3 and cluster 4)

27
Q

How can you explain the set of cluster (WSS) formula?

A

As the number of clusters increases, there is a decline in total WSS due to each cluster becoming smaller and tighter (i.e., points within each cluster are closer together)

It is expected that the rate at which WSS decreases will slow down once the clustering exceeds the optimal number of clusters (k)

28
Q

What are the other methods for establishing ‘k’ apart from using the WSS formula?

A

You can use the elbow method (graph), but you can also use:

  1. Average silhouette width (ASW) method (cluster fit)
  2. Gap statistic
29
Q

What are some clustering methods apart from using unsupervised models?

A
  1. Hierarchical-based clustering; where you group all data into a tree based on the distance between data points. There is two main types of agglomerative (bottom-up) and divisive (top-down manner)
  2. Density based clustering; uses high density points and creates cluster from them. Identify clusters of any shape in data containing noise and outliers where it works well in low dimensions
30
Q

What are some of the problems with unsupervised learning?

A
  1. There is no ‘ground truth’ to evaluate results against
  2. Interpretation is subjective
  3. Cluster choice is determined by user
  4. We also cannot say anything about the accuracy
31
Q

What are some key takeaways when interpreting findings?

A
  1. Clustering more exploratory than supervised learning
  2. Explore various numbers of clusters and try different algorithms
  3. When reporting results you can make all choices transparent and if possible, share code and data