Possible for complex data Provides optimal number of clusters Provided indicator for significant variables Segment probability score Provides diagnostic test for the best number of segments

Important 3: Clustering; Classification Flashcards by Lukas Fahrländer

What are the two methods of distance-based clustering?

Hierachircal clusteirng
K-Mean-based clustering

–>minimize the discance between the group member while max. distance to members of other groups

How well did you know this?

Not at all

Perfectly

What are the two methods of Model-based clustering?

General description of Model-based clustering?

Model-based clustering
Latent class analyis

–>Model data so that the observed variance can be represented by a small group with specific distrib. characteristics

How well did you know this?

Not at all

Perfectly

How does Heararchical clustering work?

observations are group acc. to their similarity (distance matrix) clust method used complete linkage method

How well did you know this?

Not at all

Perfectly

Dendogram: Hierarchical clusteirng
(distance-based clustering)

At the lowest level, the groups are combined into smaller groups that are relatively similar.
–>These groups are sucessively combiine with less similar groups

Height = dissiminlarity

How well did you know this?

Not at all

Perfectly

What is k-Mean-based clustering?

(also k-mean clustering) = find groups based of sum-squares deviation from the multivariate center of the assigned group

–>centers need to be specified

How well did you know this?

Not at all

Perfectly

What are the steps in (k-)mean-based clustering?

Choose number of clusters and maximum distance
–>requires numeric data
Find observation for cluster 1
Take second obersavtion if far enough from 1 –>Cluster
> Take next observation and compare with 1 and 2 (ggf. cluster 3)

How well did you know this?

Not at all

Perfectly

What do k-means cluster plots show?
What are the limitations

whether it is possible to differentiate groups based on key variables

Limitation:
K-means requires arbitrary specification of clusters (use different values for k)
–>difficult to determine whether one solution is better than the other

How well did you know this?

Not at all

Perfectly

What is the problem with K-means cluster plots?

Difficult whether one sultion is better than another
–>Repeat analysis for several number of clusters to compare the results

How well did you know this?

Not at all

Perfectly

Key facts about model-based clustering?
(mclust)

observation come from groups with different statistical distributions –>algorithm try to find best set of such underlying distribution
it clusters as being drawn from a mixture of normal distribution
Can only be used with numerical data

How well did you know this?

Not at all

Perfectly

What is the Laten Class Analysis (LCA)?
(Model-based Clustering)

differences are attributable to unobserved groups that one wishes to uncover (nclass is predetermined) –>Bayesian technique

–>Goal: estimate probabilities of membership in each class and assing individual to their most likely class

How well did you know this?

Not at all

Perfectly

Steps in Latent-class analysis?
(Model-based clustering)

Variable scores are caused by the hidden groups
LCA posits a latent variable that maximizes liklihood of obserrving the scorces and the variables
It creates a probability of each observation belonging to each segment
Segment with highest probability is the segment where most observations are placed

How well did you know this?

Not at all

Perfectly

Advantages of LCA?

Possible for complex data
Provides optimal number of clusters
Provided indicator for significant variables
Segment probability score
Provides diagnostic test for the best number of segments

How well did you know this?

Not at all

Perfectly

What are diagnostics to test for statistical fit? ( Latent class analysis - model based clustering)

BIC:Bayes information criterion (lower values better)
Error rate (better if lower)
Negative log likelihood (better if less negative)

How well did you know this?

Not at all

Perfectly

What does Naive Bayes Classification do? (Supervised learning)

= Training data is used to learn probability of class membership as a function of each predictor variable considered independently –>using bayes rule

–>starts with observed probabilities of vairbales conditiona on segments found in the training data
—>only uses one model

How well did you know this?

Not at all

Perfectly

What is the Random forrest classification (compared to Naive bayes classification? (Supervised learning)

Instead of unsing a sinlge model, it builds and ensemlbe of models that jointly classify the data by fitting many classification trees (forrest)

–>not providing class membership

How well did you know this?

Not at all

Perfectly

Advantage of Random forrest model?

Study These Flashcards

many classification trees(format) instead of one model
More accurate because more models are applied
Useful to estimate the importance of predictor variables

Classification trees are used for?

Study These Flashcards

used to predict a categorical (and usually) binary dependent varible

What is the output of model-based clustering ?

Study These Flashcards

Either shows the optimal number of clusters if g was not predetermined
- BIC
- Log-Liklihood

How can the solution of hierarchical clustering be tested?

Distance based clustering

Study These Flashcards

Zooming in and focusing on certain branches of the dendogram
use the ceophentic correlation coefficient (CPC) = measure the correlation between the original dissimilarty and the cophentic distances
CPCC close to 1 –>strong positive correlation

How can the outcome of the k- mean based model be tested?

Distance based

Study These Flashcards

Check mean values by ussing aggregate()
Plot k-mean cluster to chech if it is possible to differentiate groups based on key variables
Alternatively plot two continous variable by segment

How can the solution of the model based clusteirng be tested? ( mclust())
How to compare between models?

Study These Flashcards

Check mean values by using aggregate()
PLot model based clusters
use other values for G, and compare the model outputs:
- Log liklihood –>less negative
- BIC –>lowest value
–>also used for model comparisoon

How can the solution of the Laten class analysis be tested?

Study These Flashcards

Check mean values using aggregate()
Plot the LCA clusters
compare predicted class memberships

How can the outcome of the Naive Bayes Classification be tested?

Study These Flashcards

use test data to predict() values based on trained model using test data
Revview the segment frequencies and compare to the inital a-priori frequencies based on the training data

How can the performance of the Naiive Bayes Classification and Random forest be assed?

Study These Flashcards

Considering raw agreement rate
mean(raw$Segment == predition) = 0.92 –>92% correct prediction
compare performance against random chance using ARI
- 1 = perfect agreement, 0 = random -1 = complete disagreement
asses performance for each different class using Confusion Matrix
- actual segment is left (rows)
- predicted (columns

How can the outcome of the **Random Forest** be tested?

1. use **test data** to predict() values based on trained model **using test data** 2. plot the **clusters** based on test data

How can the performance of the **Random forrest** be assed?

1. compare performance against random chance using **ARI** (adjustedRandIndex) 1= perfect agreement, 0 = random -1 = complete disagreement 2. asses performance for **each** different class using **Confusion Matrix** using **test data** - actual segment is left (**rows**) - predicted (**columns**

What is meant by **importance** analysis in the **Random Forest model**?

the model uses many **predictor variables**, thus it is useful to know the **importance of different** classification variables -->randomForest( importance = **TRUE**)

What is the **Class imbalance**? and how can it be resolved in RandomForrest models?

using randomForest for **prediction** the model might generate values with 90% **being in one group** -->**imbalance** -->**Resolving**: - looking at frequency table of the **training data**, to see the group **allocation** -->pick **smallest** - SSet **sampsize**= "value" in randonForest model

What are the steps of Market basket analysis?

If **Non-transaction** data — can only handle **discrete and categorical values** 1. if numeric **converted** to **ordered factors** using cut() cut(data$**variable**, **breaks** (cut points intervals=), **labels**= labels to resulting intervals (names), **ordered_result=** ordered factor? -->**TRUE**!!, **right** = **FALSE** = left closed!! 2. convert **into formal transction object** using as(x, "**transactions"**) 3. find **associations rules** using apriori() -->**inspect rules** using inspect() --> plot rule **confidence** against **support**

What is the output of Latent CLass analysis consist of?

1. Top: **Conditiona item probabilities** (for each **predictor**) 2. Bottom: - **estimated** class population shares - **Predicted CLass membership** (modal posterior prob)

What are the outputs of **Naive Bayes**?

Top: **A-priori probabilitites** per segement (class membership Middle: **Condition probabilities** for each **predictor**

What are the outputs of **random forest**?

1. **Confusion matrix** with class error 2. OOB estimate of error rate: 3. **but no predictor class membership**

What is a formal transaction object?

Represents a **set of transactions** where **each transaction** consists of a **unique identifier** and a collection of items

How to get the **estimated likelihoods for each respondent** belonging to **all different** segments? (Naive bayes

Setzen Type**= raw** All probabilities for each respondent for all segment

Important 3: Clustering; Classification Flashcards

(34 cards)