Cluster Analysis Flashcards
is finding groups of objects such that the objects in a group will be similar (or related) to one another and different from the object in other groups.
Cluster Analysis
2 TYPES OF CLUSTERING
- Partitional Clustering
- Hierarchical Clustering
4 Types of Clusters
- Center-Based Clusters
- Contiguity Clusters
- Density-Based Clusters
- Conceptual Clusters
a type of cluster that is a set of objects such that an object in a cluster is closer to the “center” of a cluster, than to the center of any other cluster.
Center-Based Clusters
the center of a cluster.
Centroid
average of all points in the cluster.
Medoid
a type of cluster where each point is closer to at least one point in its cluster than to any point in any other cluster.
Contiguity Clusters
a type of cluster where
the cluster are regions of high density separated by regions of low density.
Density-Based Clusters
a type of cluster where points in a cluster share some general property that derives from the entire set points.
Conceptual Clusters
2 Objective Functions
- Global Objective Function
- Local Objective Function
typically used in partitional clustering.
Global Objective Function
3 Clustering Algorithms
- K-Means Clustering
- Hierarchical Clustering
- Density-Based Clustering
is a partitional clustering approach.
K-Means Clustering
is the mean of the points in a cluster.
Centroid
is used to measure “closeness”
Euclidean Distance
will converge typically in the first few iterations.
K-Means
3 Solution to Initial Centroid Problem: (randomly chosen centroids)
- Multiple Runs
- Sample and use hierarchical clustering to determine the initial centroids.
- Select more than k initial centroid, and the select among those the one that are far away from each other.
is the most common measure in evaluation K-means clusters.
Sum of Squared Error (SSE)
2 Pre-Processing Methods for K-Means Clusters:
- Normalize the data
- Eliminate outliers.
3 Post-Processing Methods for K-Means Clusters:
- Eliminate small clusters that may represent outliers.
- Split ‘loose’ clusters, clusters with high SSE
- Merge clusters that are ‘close’ and that have relatively low SSE
2 Limitations of K-Means Clusters:
o It has problem when clusters are of differing sizes, density, and non-globular shape.
o When data contains outliers.
5 Different Aspects of Cluster Validation:
- Determining the clustering tendency of a set of data.
- External Validation
- Internal Validation
- Compare Clustering
- Determining the ‘correct’ number of clusters.
compare the result of a cluster analysis to externally known class labels.
External Validation
evaluating how well the results of a cluster analysis fit the data without reference to external information.
Internal Validation
to determine which is better.
Compare Clustering
3 Measures of Cluster Validity
- External Index
- Internal Index
- Relative Index
used to measure the extent to which clusters label match externally supplied class labels.
External Index
used to measure the goodness of a clustering structure without respect to external information.
Internal Index
used to compare two different clustering or clusters.
Relative Index
2 Internal Measures
o Cluster Cohesion
o Cluster Separation
measures how closely related objects in a cluster are.
Cluster Cohesion
measures how distinct or well-separated a cluster is from other clusters.
Cluster Separation