Week 7 - Machine Learning II Flashcards by niki callista

Define Supervise Learning

Supervised learning takes a set of input features (predictors) and output (target) and learns a mapping function between input and output

How well did you know this?

Not at all

Perfectly

How is the training for supervised learning done?

Supervised learning is trained by using data with labels

How well did you know this?

Not at all

Perfectly

Is supervised learning always possible? Why? Reason it out.

No, supervised learning is not always possible. It is because often we do not have labelled datasets and we may have little or no idea what the labels should be

How well did you know this?

Not at all

Perfectly

What is unsupervised learning type 1?

Clustering = It is when you group data with similar characteristics.

How well did you know this?

Not at all

Perfectly

What is unsupervised learning type 2?

Association (Data Mining) = given a set of transactions, find rules that predict the occurrence of other items in the transaction

Dimensionality reduction = Reducing the dimensions/features of the data (e.g., PCA)

*PCA is principal component analysis

How well did you know this?

Not at all

Perfectly

What is clustering?

Clustering organises data that share similar characteristics into groups.

Every data point (observation) in a group is more similar to other data points in the same group than it is to data points in other groups.

How well did you know this?

Not at all

Perfectly

What is the concept of grouping data points in clustering?

1 - Data points in the same group should be more similar
2- Data points in the different groups should be as dissimilar as possible

How well did you know this?

Not at all

Perfectly

How can you measure distance in clustering and what is the effectivity ?

Effective clustering requires computing di/similarity. So the distance here can be used to add data points to the closest cluster.

The distance metrics used in ML is Euclidean. But others are also possible: Manhattan, cosine, Minkowski, hamming and chebsyhev

How well did you know this?

Not at all

Perfectly

How can you apply clustering to real-world applications?

Customer segmentation - customers can be categorised into distinct groups, each group exhibiting certain characteristics (purchasing behaviour, demographics or preferences)
Anomaly detection - network monitoring for intrusions, bank transactions for fraudulent transactions, insurance fraud detection and other anomalous behaviour
Social Network Analysis - grouping different type of communities
Web search results - returning similar text, images and videos

How well did you know this?

Not at all

Perfectly

How can you apply clustering in SCS?

Unsupervised learning approach for automatically categorizing potential suicide messages in social media
Clustering is used to group social media users who are highly similar over the available features studied
Using K-means analyse common characteristics of criminal actors by splitting them into groups using K-means clustering
Classify malware using clustering

How well did you know this?

Not at all

Perfectly

How can you cluster algorithms?

Clustering algorithms can work by partitioning-based, hierarchical-based, density-based, grid-based and model-based)

How well did you know this?

Not at all

Perfectly

What is K-means clustering?

A machine learning algorithm that groups unlabeled data points into clusters based on their features.

How well did you know this?

Not at all

Perfectly

How to assign the data points to two clusters (K=2) using the k-means algorithm?

Randomly initialise k cluster centroids
Assign every data point to its closest cluster centroid
Move each cluster centroid to the mean of the data points assigned to it
Assign every data point to its closest (new) cluster centroid
Move each cluster centroid to the mean of the data points assigned to it
Repeat step 4 and 5 until no change (centroid position is not changing or we have reached the maximum number of steps)

How well did you know this?

Not at all

Perfectly

What is supervised data?

Supervised data is when each data point (observation) has two features/predictors and a target/label variable with two outcomes

How well did you know this?

Not at all

Perfectly

What is the difference between a supervised and unsupervised case?

An unsupervised case has no labels given unlike supervised.

How well did you know this?

Not at all

Perfectly

What is cluster assignment?

Study These Flashcards

The step where you scale the data to standardise the x and y to the same scale

Define the variable of each code below:

unsup_model <- kmeans (data, centers = 4, nstart = 10, iter.max = 10)

Study These Flashcards

data - training data containing the numeric data frame

centers - number of clusters or cluster centers

nstart - number of time the k-means algorithm will be run with different centres

iter.max - maximum number of iterations allowed

What is the process of Iteration 1 for K-means?

Study These Flashcards

Randomly initialise k=4 cluster centroids
Assign every data points to its closest cluster centroid
Move each cluster centroid to the mean of the data points assigned to it
Assign every data point to its closest (new) cluster centroid

How can you move the cluster centroids in K-means?

Study These Flashcards

You can move each cluster centroid to the mean of the data points assigned to it

How do you assign data points to its closest cluster centroid?

Study These Flashcards

Using the Euclidian distance metric to measure the distance between data points

How can you position new centroids?

Study These Flashcards

The position of new centroids are through moving each cluster centroid to the mean of the data points assigned to it

What clustering Iteration 2?

Study These Flashcards

Clustering iteration 2 is when you repeat the process. So this occurs when you assign every data point to its closest cluster centroid using the new centroid position.

Do you need to re-compute for iteration2 ?

Study These Flashcards

No need to re-compute new cluster centroids since cluster elements have not changed. This will cause K-means to stop

What are the stopping conditions used to stop K-means ?

Study These Flashcards

You stop when the data points remain in the assigned clusters
Centroids of newly formed clusters do not change
Maximum number of iterations (iter.max = 10) is reached
Distance threshold reached where the minimum distance of data points from their centroids

How can you determine 'k' in K-means?

K-means requires the user to specify 'k'; though determining the optimal number of clusters in the data is fundamental; but there is no definite solution. A common approach would be to run 'k' for different values of k = 1, k = 2, ... and then you assess how good k is using within sum of squares (WSS) with elbow curve

How can you determine 'k' within (cluster) sum of squares (WCSS or WSS)?

WSS for a single cluster (WSSi) is the summed squared distance between each data point and the cluster's centroid Hence, the total within the sum of squares is the sum of all WSSi of all clusters (e.g., cluster 1, cluster 2, cluster 3 and cluster 4)

How can you explain the set of cluster (WSS) formula?

As the number of clusters increases, there is a decline in total WSS due to each cluster becoming smaller and tighter (i.e., points within each cluster are closer together) It is expected that the rate at which WSS decreases will slow down once the clustering exceeds the optimal number of clusters (k)

What are the other methods for establishing 'k' apart from using the WSS formula?

You can use the elbow method (graph), but you can also use: 1. Average silhouette width (ASW) method (cluster fit) 2. Gap statistic

What are some clustering methods apart from using unsupervised models?

1. Hierarchical-based clustering; where you group all data into a tree based on the distance between data points. There is two main types of agglomerative (bottom-up) and divisive (top-down manner) 2. Density based clustering; uses high density points and creates cluster from them. Identify clusters of any shape in data containing noise and outliers where it works well in low dimensions

What are some of the problems with unsupervised learning?

1. There is no 'ground truth' to evaluate results against 2. Interpretation is subjective 3. Cluster choice is determined by user 4. We also cannot say anything about the accuracy

What are some key takeaways when interpreting findings?

1. Clustering more exploratory than supervised learning 2. Explore various numbers of clusters and try different algorithms 3. When reporting results you can make all choices transparent and if possible, share code and data

Week 7 - Machine Learning II Flashcards

(31 cards)