Important 3: Clustering; Classification Flashcards
What are the two methods of distance-based clustering?
Hierachircal clusteirng
K-Mean-based clustering
–>minimize the discance between the group member while max. distance to members of other groups
What are the two methods of Model-based clustering?
General description of Model-based clustering?
Model-based clustering
Latent class analyis
–>Model data so that the observed variance can be represented by a small group with specific distrib. characteristics
How does Heararchical clustering work?
observations are group acc. to their similarity (distance matrix) clust method used complete linkage method
Dendogram: Hierarchical clusteirng
(distance-based clustering)
At the lowest level, the groups are combined into smaller groups that are relatively similar.
–>These groups are sucessively combiine with less similar groups
Height = dissiminlarity
What is k-Mean-based clustering?
(also k-mean clustering) = find groups based of sum-squares deviation from the multivariate center of the assigned group
–>centers need to be specified
What are the steps in (k-)mean-based clustering?
-
Choose number of clusters and maximum distance
–>requires numeric data - Find observation for cluster 1
- Take second obersavtion if far enough from 1 –>Cluster
- > Take next observation and compare with 1 and 2 (ggf. cluster 3)
What do k-means cluster plots show?
What are the limitations
whether it is possible to differentiate groups based on key variables
Limitation:
K-means requires arbitrary specification of clusters (use different values for k)
–>difficult to determine whether one solution is better than the other
What is the problem with K-means cluster plots?
Difficult whether one sultion is better than another
–>Repeat analysis for several number of clusters to compare the results
Key facts about model-based clustering?
(mclust)
- observation come from groups with different statistical distributions –>algorithm try to find best set of such underlying distribution
- it clusters as being drawn from a mixture of normal distribution
- Can only be used with numerical data
What is the Laten Class Analysis (LCA)?
(Model-based Clustering)
differences are attributable to unobserved groups that one wishes to uncover (nclass is predetermined) –>Bayesian technique
–>Goal: estimate probabilities of membership in each class and assing individual to their most likely class
Steps in Latent-class analysis?
(Model-based clustering)
- Variable scores are caused by the hidden groups
- LCA posits a latent variable that maximizes liklihood of obserrving the scorces and the variables
- It creates a probability of each observation belonging to each segment
- Segment with highest probability is the segment where most observations are placed
Advantages of LCA?
- Possible for complex data
- Provides optimal number of clusters
- Provided indicator for significant variables
- Segment probability score
Provides diagnostic test for the best number of segments
What are diagnostics to test for statistical fit? ( Latent class analysis - model based clustering)
- BIC:Bayes information criterion (lower values better)
- Error rate (better if lower)
- Negative log likelihood (better if less negative)
What does Naive Bayes Classification do? (Supervised learning)
= Training data is used to learn probability of class membership as a function of each predictor variable considered independently –>using bayes rule
–>starts with observed probabilities of vairbales conditiona on segments found in the training data
—>only uses one model
What is the Random forrest classification (compared to Naive bayes classification? (Supervised learning)
Instead of unsing a sinlge model, it builds and ensemlbe of models that jointly classify the data by fitting many classification trees (forrest)
–>not providing class membership
Advantage of Random forrest model?
- many classification trees(format) instead of one model
- More accurate because more models are applied
- Useful to estimate the importance of predictor variables
Classification trees are used for?
used to predict a categorical (and usually) binary dependent varible
What is the output of model-based clustering ?
Either shows the optimal number of clusters if g was not predetermined
- BIC
- Log-Liklihood
How can the solution of hierarchical clustering be tested?
Distance based clustering
- Zooming in and focusing on certain branches of the dendogram
- use the ceophentic correlation coefficient (CPC) = measure the correlation between the original dissimilarty and the cophentic distances
- CPCC close to 1 –>strong positive correlation
How can the outcome of the k- mean based model be tested?
Distance based
- Check mean values by ussing aggregate()
- Plot k-mean cluster to chech if it is possible to differentiate groups based on key variables
- Alternatively plot two continous variable by segment
How can the solution of the model based clusteirng be tested? ( mclust())
How to compare between models?
- Check mean values by using aggregate()
- PLot model based clusters
- use other values for G, and compare the model outputs:
- Log liklihood –>less negative
- BIC –>lowest value
–>also used for model comparisoon
How can the solution of the Laten class analysis be tested?
- Check mean values using aggregate()
- Plot the LCA clusters
- compare predicted class memberships
How can the outcome of the Naive Bayes Classification be tested?
- use test data to predict() values based on trained model using test data
- Revview the segment frequencies and compare to the inital a-priori frequencies based on the training data
How can the performance of the Naiive Bayes Classification and Random forest be assed?
- Considering raw agreement rate
mean(raw$Segment == predition) = 0.92 –>92% correct prediction - compare performance against random chance using ARI
- 1 = perfect agreement, 0 = random -1 = complete disagreement - asses performance for each different class using Confusion Matrix
- actual segment is left (rows)
- predicted (columns
How can the outcome of the Random Forest be tested?
- use test data to predict() values based on trained model using test data
- plot the clusters based on test data
How can the performance of the Random forrest be assed?
- compare performance against random chance using ARI (adjustedRandIndex)
1= perfect agreement, 0 = random -1 = complete disagreement - asses performance for each different class using Confusion Matrix using test data
- actual segment is left (rows)
- predicted (columns
What is meant by importance analysis in the Random Forest model?
the model uses many predictor variables, thus it is useful to know the importance of different classification variables
–>randomForest( importance = TRUE)
What is the Class imbalance? and how can it be resolved in RandomForrest models?
using randomForest for prediction the model might generate values with 90% being in one group –>imbalance
–>Resolving:
- looking at frequency table of the training data, to see the group allocation –>pick smallest
- SSet sampsize= “value” in randonForest model
What are the steps of Market basket analysis?
If Non-transaction data — can only handle discrete and categorical values
1. if numeric converted to ordered factors using cut()
cut(data$variable, breaks (cut points intervals=), labels= labels to resulting intervals (names), ordered_result= ordered factor? –>TRUE!!, right = FALSE = left closed!!
- convert into formal transction object using as(x, “transactions”)
- find associations rules using apriori()
–>inspect rules using inspect()
–> plot rule confidence against support
What is the output of Latent CLass analysis consist of?
- Top: Conditiona item probabilities (for each predictor)
- Bottom:
- estimated class population shares
- Predicted CLass membership (modal posterior prob)
What are the outputs of Naive Bayes?
Top: A-priori probabilitites per segement (class membership
Middle: Condition probabilities for each predictor
What are the outputs of random forest?
- Confusion matrix with class error
- OOB estimate of error rate:
- but no predictor class membership
What is a formal transaction object?
Represents a set of transactions where each transaction consists of a unique identifier and a collection of items
How to get the estimated likelihoods for each respondent belonging to all different segments? (Naive bayes
Setzen Type= raw
All probabilities for each respondent for all segment