Unit 5 - Advanced clustering Flashcards
Problems with conventional clustering methods?
> Computational complexity issues
- time efficiency (hierarchical)
- space efficiency
- mixed data types issue (k-means)
What are the two types of advanced clustering method? Known as data mining methods
Self-organising map
two-step clustering
What’s the common purpose of advanced clustering method?
To summarize data, data visualization
To display similarity/dissimilarity among objects
What is a self-organising map?
- Field of Artificial Intelligence
- A kind of artificial neural network
- Simulate information processing of brain
What is a neural network?
design of an artificial intelligent system to mimic the neurophysiology and performance characteristics of the human brain
What are the key components of a neural network?
> Architecture—the number of layers of neurons and the pattern of connections between these neurons.
Algorithm—the rules for determining the connection weights and consequently the learning behaviour of the network.
Activation function—the manner in which a neuron’s output is generated.
How’s the connection for the input neurons in SOM?
The input neurons are connected with their output
counterparts in a forward fashion
How’s the connection for the output neurons in SOM?
The neurons in the output layer are laterally connected
among themselves
Interesting features of SOM?
- No prior assumption of the number of clusters
- Cluster structure emerged as a result of numerous interactions between neurons (self-organization)
- Does not track cluster memberships of items
- Use the idea of competitive learning
- Winning neurons get to learn
- In some sense similar to k-means
- Non-hierarchical clustering method
Strengths of SOM?
- Logical and intuitive
- No need to set the number of clusters
- Good for visualizing high-dimensional data
- Performs well when cluster sizes are different
- Not easily affected by noisy data
Weaknesses of SOM?
> Quite a number of parameters to set
- Map size
- Learning rate (rates of updating neurons)
Performance sensitive of parameter settings
What are the steps of two-step clustering?
Step 1: Pre-cluster data to sub-clusters
- Cluster Feature (CF, summarize the information of a cluster)
- CF tree
- Creating sub-clusters using CF tree
Step 2: Auto-Cluster the sub-clusters
- Use BIC to determine the number of clusters
- Refine the number of clusters using the ratio of inter-cluster distances
- Assign data objects to the determined clusters
How is pre-cluster conducted?
Uses a sequential clustering approach
- scans the records one by one based on a distance criterion
- current record is either merged with the existing sub-clusters or start a new sub-cluster
What are the parameters of a CF tree?
1) Branching factor B
• A non-leaf node containing at most B entries
2) Leaf node factor L
• A leaf node containing at most L entries
3) Threshold T
• Whether merge two sub-cluster or start a new sub-cluster
• If dist(CFi , CFj ) < T, merge them
Strengths of two-step?
- Deal with large data
- auto-determine the number of clusters using BIC and ratio change in cluster distances
- Deal with mixed type attributes