Important 3: Clustering; Classification Flashcards
What are the two methods of distance-based clustering?
Hierachircal clusteirng
K-Mean-based clustering
–>minimize the discance between the group member while max. distance to members of other groups
What are the two methods of Model-based clustering?
General description of Model-based clustering?
Model-based clustering
Latent class analyis
–>Model data so that the observed variance can be represented by a small group with specific distrib. characteristics
How does Heararchical clustering work?
observations are group acc. to their similarity (distance matrix) clust method used complete linkage method
Dendogram: Hierarchical clusteirng
(distance-based clustering)
At the lowest level, the groups are combined into smaller groups that are relatively similar.
–>These groups are sucessively combiine with less similar groups
Height = dissiminlarity
What is k-Mean-based clustering?
(also k-mean clustering) = find groups based of sum-squares deviation from the multivariate center of the assigned group
–>centers need to be specified
What are the steps in (k-)mean-based clustering?
-
Choose number of clusters and maximum distance
–>requires numeric data - Find observation for cluster 1
- Take second obersavtion if far enough from 1 –>Cluster
- > Take next observation and compare with 1 and 2 (ggf. cluster 3)
What do k-means cluster plots show?
What are the limitations
whether it is possible to differentiate groups based on key variables
Limitation:
K-means requires arbitrary specification of clusters (use different values for k)
–>difficult to determine whether one solution is better than the other
What is the problem with K-means cluster plots?
Difficult whether one sultion is better than another
–>Repeat analysis for several number of clusters to compare the results
Key facts about model-based clustering?
(mclust)
- observation come from groups with different statistical distributions –>algorithm try to find best set of such underlying distribution
- it clusters as being drawn from a mixture of normal distribution
- Can only be used with numerical data
What is the Laten Class Analysis (LCA)?
(Model-based Clustering)
differences are attributable to unobserved groups that one wishes to uncover (nclass is predetermined) –>Bayesian technique
–>Goal: estimate probabilities of membership in each class and assing individual to their most likely class
Steps in Latent-class analysis?
(Model-based clustering)
- Variable scores are caused by the hidden groups
- LCA posits a latent variable that maximizes liklihood of obserrving the scorces and the variables
- It creates a probability of each observation belonging to each segment
- Segment with highest probability is the segment where most observations are placed
Advantages of LCA?
- Possible for complex data
- Provides optimal number of clusters
- Provided indicator for significant variables
- Segment probability score
Provides diagnostic test for the best number of segments
What are diagnostics to test for statistical fit? ( Latent class analysis - model based clustering)
- BIC:Bayes information criterion (lower values better)
- Error rate (better if lower)
- Negative log likelihood (better if less negative)
What does Naive Bayes Classification do? (Supervised learning)
= Training data is used to learn probability of class membership as a function of each predictor variable considered independently –>using bayes rule
–>starts with observed probabilities of vairbales conditiona on segments found in the training data
—>only uses one model
What is the Random forrest classification (compared to Naive bayes classification? (Supervised learning)
Instead of unsing a sinlge model, it builds and ensemlbe of models that jointly classify the data by fitting many classification trees (forrest)
–>not providing class membership