L10; CLUSTERING Flashcards
clustering
clustering is used to group/ classify or to create subsets of data with similar attributes.
it works by calculating the similarity of different objects.
this is often considered as the inverse of distance.
limitation; similarity is sometimes difficult to define and different similarity criteria can lead to different clustering results.
clustering types
- Hierarchical Clustering
2. Non-Hierarchical clustering
hierarchical clustering (2)
algorithms;
- Agglomerative clustering (bottom up)
- Divisive clustering (top down)
three way of distance measures
single linkage (closest point between two) Complete Linkage (Furthest neighbour) Average Linkage ( calculate every single points and then use average)
K-means clustering
K-Means is the most commonly used clustering algorithm.
K refers to the number of clusters you want to classify your data into.
procedure of k-means clustering
- choose value for K, the number of clusters.
- Randomly choose K points as centroids.
- Assign items to cluster with nearest centroid(mean).
- Recalculate centroids as the average of all data points in a cluster.
- repeat steps 3 and 4 till no more reassignments or reach max number of iterations.
k-means clustering limitations
difficult to choose K, need human inspection or novel algorithms.
dependant on seeds/ center positions;
sensitive to outliers;
variable reduction
variable reduction techniques can be used to reduce the dimensions( variables/ columns) of a dataset before applying clustering methods.
This allows clustering on multidimensional data to be visualised in 2 or 3 dimensional space.
Principal Components Analysis and Exploratory Factor Analysis will be covered in the Forecasting and Advanced Business Analytics module.