Important 3: Clustering; Classification Flashcards
What are the two methods of distance-based clustering?
Hierachircal clusteirng
K-Mean-based clustering
–>minimize the discance between the group member while max. distance to members of other groups
What are the two methods of Model-based clustering?
General description of Model-based clustering?
Model-based clustering
Latent class analyis
–>Model data so that the observed variance can be represented by a small group with specific distrib. characteristics
How does Heararchical clustering work?
observations are group acc. to their similarity (distance matrix) clust method used complete linkage method
Dendogram: Hierarchical clusteirng
(distance-based clustering)
At the lowest level, the groups are combined into smaller groups that are relatively similar.
–>These groups are sucessively combiine with less similar groups
Height = dissiminlarity
What is k-Mean-based clustering?
(also k-mean clustering) = find groups based of sum-squares deviation from the multivariate center of the assigned group
–>centers need to be specified
What are the steps in (k-)mean-based clustering?
-
Choose number of clusters and maximum distance
–>requires numeric data - Find observation for cluster 1
- Take second obersavtion if far enough from 1 –>Cluster
- > Take next observation and compare with 1 and 2 (ggf. cluster 3)
What do k-means cluster plots show?
What are the limitations
whether it is possible to differentiate groups based on key variables
Limitation:
K-means requires arbitrary specification of clusters (use different values for k)
–>difficult to determine whether one solution is better than the other
What is the problem with K-means cluster plots?
Difficult whether one sultion is better than another
–>Repeat analysis for several number of clusters to compare the results
Key facts about model-based clustering?
(mclust)
- observation come from groups with different statistical distributions –>algorithm try to find best set of such underlying distribution
- it clusters as being drawn from a mixture of normal distribution
- Can only be used with numerical data
What is the Laten Class Analysis (LCA)?
(Model-based Clustering)
differences are attributable to unobserved groups that one wishes to uncover (nclass is predetermined) –>Bayesian technique
–>Goal: estimate probabilities of membership in each class and assing individual to their most likely class
Steps in Latent-class analysis?
(Model-based clustering)
- Variable scores are caused by the hidden groups
- LCA posits a latent variable that maximizes liklihood of obserrving the scorces and the variables
- It creates a probability of each observation belonging to each segment
- Segment with highest probability is the segment where most observations are placed
Advantages of LCA?
- Possible for complex data
- Provides optimal number of clusters
- Provided indicator for significant variables
- Segment probability score
Provides diagnostic test for the best number of segments
What are diagnostics to test for statistical fit? ( Latent class analysis - model based clustering)
- BIC:Bayes information criterion (lower values better)
- Error rate (better if lower)
- Negative log likelihood (better if less negative)
What does Naive Bayes Classification do? (Supervised learning)
= Training data is used to learn probability of class membership as a function of each predictor variable considered independently –>using bayes rule
–>starts with observed probabilities of vairbales conditiona on segments found in the training data
—>only uses one model
What is the Random forrest classification (compared to Naive bayes classification? (Supervised learning)
Instead of unsing a sinlge model, it builds and ensemlbe of models that jointly classify the data by fitting many classification trees (forrest)
–>not providing class membership
Advantage of Random forrest model?
- many classification trees(format) instead of one model
- More accurate because more models are applied
- Useful to estimate the importance of predictor variables
Classification trees are used for?
used to predict a categorical (and usually) binary dependent varible
What is the output of model-based clustering ?
Either shows the optimal number of clusters if g was not predetermined
- BIC
- Log-Liklihood
How can the solution of hierarchical clustering be tested?
Distance based clustering
- Zooming in and focusing on certain branches of the dendogram
- use the ceophentic correlation coefficient (CPC) = measure the correlation between the original dissimilarty and the cophentic distances
- CPCC close to 1 –>strong positive correlation
How can the outcome of the k- mean based model be tested?
Distance based
- Check mean values by ussing aggregate()
- Plot k-mean cluster to chech if it is possible to differentiate groups based on key variables
- Alternatively plot two continous variable by segment
How can the solution of the model based clusteirng be tested? ( mclust())
How to compare between models?
- Check mean values by using aggregate()
- PLot model based clusters
- use other values for G, and compare the model outputs:
- Log liklihood –>less negative
- BIC –>lowest value
–>also used for model comparisoon
How can the solution of the Laten class analysis be tested?
- Check mean values using aggregate()
- Plot the LCA clusters
- compare predicted class memberships
How can the outcome of the Naive Bayes Classification be tested?
- use test data to predict() values based on trained model using test data
- Revview the segment frequencies and compare to the inital a-priori frequencies based on the training data
How can the performance of the Naiive Bayes Classification and Random forest be assed?
- Considering raw agreement rate
mean(raw$Segment == predition) = 0.92 –>92% correct prediction - compare performance against random chance using ARI
- 1 = perfect agreement, 0 = random -1 = complete disagreement - asses performance for each different class using Confusion Matrix
- actual segment is left (rows)
- predicted (columns