Week 7 - Machine Learning II Flashcards
Define Supervise Learning
Supervised learning takes a set of input features (predictors) and output (target) and learns a mapping function between input and output
How is the training for supervised learning done?
Supervised learning is trained by using data with labels
Is supervised learning always possible? Why? Reason it out.
No, supervised learning is not always possible. It is because often we do not have labelled datasets and we may have little or no idea what the labels should be
What is unsupervised learning type 1?
Clustering = It is when you group data with similar characteristics.
What is unsupervised learning type 2?
Association (Data Mining) = given a set of transactions, find rules that predict the occurrence of other items in the transaction
Dimensionality reduction = Reducing the dimensions/features of the data (e.g., PCA)
*PCA is principal component analysis
What is clustering?
Clustering organises data that share similar characteristics into groups.
Every data point (observation) in a group is more similar to other data points in the same group than it is to data points in other groups.
What is the concept of grouping data points in clustering?
1 - Data points in the same group should be more similar
2- Data points in the different groups should be as dissimilar as possible
How can you measure distance in clustering and what is the effectivity ?
Effective clustering requires computing di/similarity. So the distance here can be used to add data points to the closest cluster.
The distance metrics used in ML is Euclidean. But others are also possible: Manhattan, cosine, Minkowski, hamming and chebsyhev
How can you apply clustering to real-world applications?
- Customer segmentation - customers can be categorised into distinct groups, each group exhibiting certain characteristics (purchasing behaviour, demographics or preferences)
- Anomaly detection - network monitoring for intrusions, bank transactions for fraudulent transactions, insurance fraud detection and other anomalous behaviour
- Social Network Analysis - grouping different type of communities
- Web search results - returning similar text, images and videos
How can you apply clustering in SCS?
- Unsupervised learning approach for automatically categorizing potential suicide messages in social media
- Clustering is used to group social media users who are highly similar over the available features studied
- Using K-means analyse common characteristics of criminal actors by splitting them into groups using K-means clustering
- Classify malware using clustering
How can you cluster algorithms?
Clustering algorithms can work by partitioning-based, hierarchical-based, density-based, grid-based and model-based)
What is K-means clustering?
A machine learning algorithm that groups unlabeled data points into clusters based on their features.
How to assign the data points to two clusters (K=2) using the k-means algorithm?
- Randomly initialise k cluster centroids
- Assign every data point to its closest cluster centroid
- Move each cluster centroid to the mean of the data points assigned to it
- Assign every data point to its closest (new) cluster centroid
- Move each cluster centroid to the mean of the data points assigned to it
- Repeat step 4 and 5 until no change (centroid position is not changing or we have reached the maximum number of steps)
What is supervised data?
Supervised data is when each data point (observation) has two features/predictors and a target/label variable with two outcomes
What is the difference between a supervised and unsupervised case?
An unsupervised case has no labels given unlike supervised.
What is cluster assignment?
The step where you scale the data to standardise the x and y to the same scale
Define the variable of each code below:
unsup_model <- kmeans (data, centers = 4, nstart = 10, iter.max = 10)
data - training data containing the numeric data frame
centers - number of clusters or cluster centers
nstart - number of time the k-means algorithm will be run with different centres
iter.max - maximum number of iterations allowed
What is the process of Iteration 1 for K-means?
- Randomly initialise k=4 cluster centroids
- Assign every data points to its closest cluster centroid
- Move each cluster centroid to the mean of the data points assigned to it
- Assign every data point to its closest (new) cluster centroid
How can you move the cluster centroids in K-means?
You can move each cluster centroid to the mean of the data points assigned to it
How do you assign data points to its closest cluster centroid?
Using the Euclidian distance metric to measure the distance between data points
How can you position new centroids?
The position of new centroids are through moving each cluster centroid to the mean of the data points assigned to it
What clustering Iteration 2?
Clustering iteration 2 is when you repeat the process. So this occurs when you assign every data point to its closest cluster centroid using the new centroid position.
Do you need to re-compute for iteration2 ?
No need to re-compute new cluster centroids since cluster elements have not changed. This will cause K-means to stop
What are the stopping conditions used to stop K-means ?
- You stop when the data points remain in the assigned clusters
- Centroids of newly formed clusters do not change
- Maximum number of iterations (iter.max = 10) is reached
- Distance threshold reached where the minimum distance of data points from their centroids
How can you determine ‘k’ in K-means?
K-means requires the user to specify ‘k’; though determining the optimal number of clusters in the data is fundamental; but there is no definite solution.
A common approach would be to run ‘k’ for different values of k = 1, k = 2, … and then you assess how good k is using within sum of squares (WSS) with elbow curve
How can you determine ‘k’ within (cluster) sum of squares (WCSS or WSS)?
WSS for a single cluster (WSSi) is the summed squared distance between each data point and the cluster’s centroid
Hence, the total within the sum of squares is the sum of all WSSi of all clusters (e.g., cluster 1, cluster 2, cluster 3 and cluster 4)
How can you explain the set of cluster (WSS) formula?
As the number of clusters increases, there is a decline in total WSS due to each cluster becoming smaller and tighter (i.e., points within each cluster are closer together)
It is expected that the rate at which WSS decreases will slow down once the clustering exceeds the optimal number of clusters (k)
What are the other methods for establishing ‘k’ apart from using the WSS formula?
You can use the elbow method (graph), but you can also use:
- Average silhouette width (ASW) method (cluster fit)
- Gap statistic
What are some clustering methods apart from using unsupervised models?
- Hierarchical-based clustering; where you group all data into a tree based on the distance between data points. There is two main types of agglomerative (bottom-up) and divisive (top-down manner)
- Density based clustering; uses high density points and creates cluster from them. Identify clusters of any shape in data containing noise and outliers where it works well in low dimensions
What are some of the problems with unsupervised learning?
- There is no ‘ground truth’ to evaluate results against
- Interpretation is subjective
- Cluster choice is determined by user
- We also cannot say anything about the accuracy
What are some key takeaways when interpreting findings?
- Clustering more exploratory than supervised learning
- Explore various numbers of clusters and try different algorithms
- When reporting results you can make all choices transparent and if possible, share code and data