Unit 5 - Advanced clustering Flashcards

Question 1

Q

Problems with conventional clustering methods?

Answer

A

> Computational complexity issues

time efficiency (hierarchical)
space efficiency
mixed data types issue (k-means)

Question 2

Q

What are the two types of advanced clustering method? Known as data mining methods

Answer

A

Self-organising map

two-step clustering

Question 3

Q

What’s the common purpose of advanced clustering method?

Answer

A

To summarize data, data visualization

To display similarity/dissimilarity among objects

Question 4

Q

What is a self-organising map?

Answer

A

Field of Artificial Intelligence
A kind of artificial neural network
Simulate information processing of brain

Question 5

Q

What is a neural network?

Answer

A

design of an artificial intelligent system to mimic the neurophysiology and performance characteristics of the human brain

Question 6

Q

What are the key components of a neural network?

Answer

A

> Architecture—the number of layers of neurons and the pattern of connections between these neurons.
Algorithm—the rules for determining the connection weights and consequently the learning behaviour of the network.
Activation function—the manner in which a neuron’s output is generated.

Question 7

Q

How’s the connection for the input neurons in SOM?

Answer

A

The input neurons are connected with their output

counterparts in a forward fashion

Question 8

Q

How’s the connection for the output neurons in SOM?

Answer

A

The neurons in the output layer are laterally connected

among themselves

Question 9

Q

Interesting features of SOM?

Answer

A

No prior assumption of the number of clusters
Cluster structure emerged as a result of numerous interactions between neurons (self-organization)
Does not track cluster memberships of items
Use the idea of competitive learning
Winning neurons get to learn
In some sense similar to k-means
Non-hierarchical clustering method

Question 10

Q

Strengths of SOM?

Answer

A

Logical and intuitive
No need to set the number of clusters
Good for visualizing high-dimensional data
Performs well when cluster sizes are different
Not easily affected by noisy data

Question 11

Q

Weaknesses of SOM?

Answer

A

> Quite a number of parameters to set
- Map size
- Learning rate (rates of updating neurons)
Performance sensitive of parameter settings

Question 12

Q

What are the steps of two-step clustering?

Answer

A

Step 1: Pre-cluster data to sub-clusters
- Cluster Feature (CF, summarize the information of a cluster)
- CF tree
- Creating sub-clusters using CF tree
Step 2: Auto-Cluster the sub-clusters
- Use BIC to determine the number of clusters
- Refine the number of clusters using the ratio of inter-cluster distances
- Assign data objects to the determined clusters

Question 13

Q

How is pre-cluster conducted?

Answer

A

Uses a sequential clustering approach

scans the records one by one based on a distance criterion
current record is either merged with the existing sub-clusters or start a new sub-cluster

Question 14

Q

What are the parameters of a CF tree?

Answer

A

1) Branching factor B
• A non-leaf node containing at most B entries
2) Leaf node factor L
• A leaf node containing at most L entries
3) Threshold T
• Whether merge two sub-cluster or start a new sub-cluster
• If dist(CFi , CFj ) < T, merge them

Question 15

Q

Strengths of two-step?

Answer

A

Deal with large data
auto-determine the number of clusters using BIC and ratio change in cluster distances
Deal with mixed type attributes

Question 16

Q

Weaknesses of two-step?

Answer

Study These Flashcards

A

Still need to guess a range for the number of clusters
Parameter settings still required for CF Tree
Example: T requires some “fine-tuning”

Question 17

Q

Other clustering algorithms?

Answer

Study These Flashcards

A

birch
cure
DBScan
proclus
o-clustering

Question 18

Q

What is density-based spatial clustering of applications with noise (DBScan) used for?

Answer

Study These Flashcards

A

Dense regions: clusters
low-density regions: noise

in 2D, complexity is linear : O(N)
in high dimension, can become O(N^2)

Question 19

Q

What is Balanced Iterative Reducing and Clustering using Hierarchies (birch)?

Answer

Study These Flashcards

A

Use a CF Tree (Zhang et al., 1996)
Choice of CF Tree parameters is critical to its performance
Only applicable to numeric data
TwoStep is an extension of BIRCH

Question 20

Q

What is clustering using representatives (cure)?

Answer

Study These Flashcards

A

Step 1:
Draw a random sample from the dataset and perform hierarchical clustering
Step 2:
Assign each remaining data points to the clusters
Can find clusters of complex shapes and different sizes Insensitive to outliers

Question 21

Q

what is projected clustering (proclus)?

Answer

Study These Flashcards

A

Select initial mediods (i.e. representative objects of clusters) that are far from each other
Compute centroids, points near centroids are selected as mediods
Refinement: reassign points to mediods with outliers removed

Question 22

Q

what is o clustering?

Answer

Study These Flashcards

A

Produce optimal grid-partitioning of data, create clusters that define dense areas in the attribute space
Overall complexity: O(nd)
Where d < n is the number of partition
Insensitive to noise

Unit 5 - Advanced clustering Flashcards

(22 cards)