Unit 5 - Advanced clustering Flashcards

1
Q

Problems with conventional clustering methods?

A

> Computational complexity issues

  • time efficiency (hierarchical)
  • space efficiency
  • mixed data types issue (k-means)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two types of advanced clustering method? Known as data mining methods

A

Self-organising map

two-step clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What’s the common purpose of advanced clustering method?

A

To summarize data, data visualization

To display similarity/dissimilarity among objects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a self-organising map?

A
  • Field of Artificial Intelligence
  • A kind of artificial neural network
  • Simulate information processing of brain
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a neural network?

A

design of an artificial intelligent system to mimic the neurophysiology and performance characteristics of the human brain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the key components of a neural network?

A

> Architecture—the number of layers of neurons and the pattern of connections between these neurons.
Algorithm—the rules for determining the connection weights and consequently the learning behaviour of the network.
Activation function—the manner in which a neuron’s output is generated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How’s the connection for the input neurons in SOM?

A

The input neurons are connected with their output

counterparts in a forward fashion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How’s the connection for the output neurons in SOM?

A

The neurons in the output layer are laterally connected

among themselves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Interesting features of SOM?

A
  • No prior assumption of the number of clusters
  • Cluster structure emerged as a result of numerous interactions between neurons (self-organization)
  • Does not track cluster memberships of items
  • Use the idea of competitive learning
  • Winning neurons get to learn
  • In some sense similar to k-means
  • Non-hierarchical clustering method
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Strengths of SOM?

A
  • Logical and intuitive
  • No need to set the number of clusters
  • Good for visualizing high-dimensional data
  • Performs well when cluster sizes are different
  • Not easily affected by noisy data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Weaknesses of SOM?

A

> Quite a number of parameters to set
- Map size
- Learning rate (rates of updating neurons)
Performance sensitive of parameter settings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the steps of two-step clustering?

A

Step 1: Pre-cluster data to sub-clusters
- Cluster Feature (CF, summarize the information of a cluster)
- CF tree
- Creating sub-clusters using CF tree
Step 2: Auto-Cluster the sub-clusters
- Use BIC to determine the number of clusters
- Refine the number of clusters using the ratio of inter-cluster distances
- Assign data objects to the determined clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is pre-cluster conducted?

A

Uses a sequential clustering approach

  • scans the records one by one based on a distance criterion
  • current record is either merged with the existing sub-clusters or start a new sub-cluster
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the parameters of a CF tree?

A

1) Branching factor B
• A non-leaf node containing at most B entries
2) Leaf node factor L
• A leaf node containing at most L entries
3) Threshold T
• Whether merge two sub-cluster or start a new sub-cluster
• If dist(CFi , CFj ) < T, merge them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Strengths of two-step?

A
  • Deal with large data
  • auto-determine the number of clusters using BIC and ratio change in cluster distances
  • Deal with mixed type attributes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Weaknesses of two-step?

A
  • Still need to guess a range for the number of clusters
  • Parameter settings still required for CF Tree
    Example: T requires some “fine-tuning”
17
Q

Other clustering algorithms?

A
  • birch
  • cure
  • DBScan
  • proclus
  • o-clustering
18
Q

What is density-based spatial clustering of applications with noise (DBScan) used for?

A
  • Dense regions: clusters
  • low-density regions: noise

in 2D, complexity is linear : O(N)
in high dimension, can become O(N^2)

19
Q

What is Balanced Iterative Reducing and Clustering using Hierarchies (birch)?

A
  • Use a CF Tree (Zhang et al., 1996)
  • Choice of CF Tree parameters is critical to its performance
  • Only applicable to numeric data
  • TwoStep is an extension of BIRCH
20
Q

What is clustering using representatives (cure)?

A

Step 1:
Draw a random sample from the dataset and perform hierarchical clustering
Step 2:
Assign each remaining data points to the clusters
Can find clusters of complex shapes and different sizes Insensitive to outliers

21
Q

what is projected clustering (proclus)?

A
  • Select initial mediods (i.e. representative objects of clusters) that are far from each other
  • Compute centroids, points near centroids are selected as mediods
  • Refinement: reassign points to mediods with outliers removed
22
Q

what is o clustering?

A
  • Produce optimal grid-partitioning of data, create clusters that define dense areas in the attribute space
  • Overall complexity: O(nd)
    Where d < n is the number of partition
  • Insensitive to noise