Important 3: Clustering; Classification Flashcards

1
Q

What are the two methods of distance-based clustering?

A

Hierachircal clusteirng
K-Mean-based clustering

–>minimize the discance between the group member while max. distance to members of other groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two methods of Model-based clustering?

General description of Model-based clustering?

A

Model-based clustering
Latent class analyis

–>Model data so that the observed variance can be represented by a small group with specific distrib. characteristics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does Heararchical clustering work?

A

observations are group acc. to their similarity (distance matrix) clust method used complete linkage method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Dendogram: Hierarchical clusteirng
(distance-based clustering)

A

At the lowest level, the groups are combined into smaller groups that are relatively similar.
–>These groups are sucessively combiine with less similar groups

Height = dissiminlarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is k-Mean-based clustering?

A

(also k-mean clustering) = find groups based of sum-squares deviation from the multivariate center of the assigned group

–>centers need to be specified

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the steps in (k-)mean-based clustering?

A
  1. Choose number of clusters and maximum distance
    –>requires numeric data
  2. Find observation for cluster 1
  3. Take second obersavtion if far enough from 1 –>Cluster
  4. > Take next observation and compare with 1 and 2 (ggf. cluster 3)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What do k-means cluster plots show?
What are the limitations

A

whether it is possible to differentiate groups based on key variables

Limitation:
K-means requires arbitrary specification of clusters (use different values for k)
–>difficult to determine whether one solution is better than the other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the problem with K-means cluster plots?

A

Difficult whether one sultion is better than another
–>Repeat analysis for several number of clusters to compare the results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Key facts about model-based clustering?
(mclust)

A
  • observation come from groups with different statistical distributions –>algorithm try to find best set of such underlying distribution
  • it clusters as being drawn from a mixture of normal distribution
  • Can only be used with numerical data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the Laten Class Analysis (LCA)?
(Model-based Clustering)

A

differences are attributable to unobserved groups that one wishes to uncover (nclass is predetermined) –>Bayesian technique

–>Goal: estimate probabilities of membership in each class and assing individual to their most likely class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Steps in Latent-class analysis?
(Model-based clustering)

A
  1. Variable scores are caused by the hidden groups
  2. LCA posits a latent variable that maximizes liklihood of obserrving the scorces and the variables
  3. It creates a probability of each observation belonging to each segment
  4. Segment with highest probability is the segment where most observations are placed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Advantages of LCA?

A
  • Possible for complex data
  • Provides optimal number of clusters
  • Provided indicator for significant variables
  • Segment probability score
    Provides diagnostic test for the best number of segments
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are diagnostics to test for statistical fit? ( Latent class analysis - model based clustering)

A
  • BIC:Bayes information criterion (lower values better)
  • Error rate (better if lower)
  • Negative log likelihood (better if less negative)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does Naive Bayes Classification do? (Supervised learning)

A

= Training data is used to learn probability of class membership as a function of each predictor variable considered independently –>using bayes rule

–>starts with observed probabilities of vairbales conditiona on segments found in the training data
—>only uses one model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the Random forrest classification (compared to Naive bayes classification? (Supervised learning)

A

Instead of unsing a sinlge model, it builds and ensemlbe of models that jointly classify the data by fitting many classification trees (forrest)

–>not providing class membership

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Advantage of Random forrest model?

A
  • many classification trees(format) instead of one model
  • More accurate because more models are applied
  • Useful to estimate the importance of predictor variables
17
Q

Classification trees are used for?

A

used to predict a categorical (and usually) binary dependent varible

18
Q

What is the output of model-based clustering ?

A

Either shows the optimal number of clusters if g was not predetermined
- BIC
- Log-Liklihood

19
Q

How can the solution of hierarchical clustering be tested?

Distance based clustering

A
  • Zooming in and focusing on certain branches of the dendogram
  • use the ceophentic correlation coefficient (CPC) = measure the correlation between the original dissimilarty and the cophentic distances
  • CPCC close to 1 –>strong positive correlation
20
Q

How can the outcome of the k- mean based model be tested?

Distance based

A
  1. Check mean values by ussing aggregate()
  2. Plot k-mean cluster to chech if it is possible to differentiate groups based on key variables
  3. Alternatively plot two continous variable by segment
21
Q

How can the solution of the model based clusteirng be tested? ( mclust())
How to compare between models?

A
  1. Check mean values by using aggregate()
  2. PLot model based clusters
  3. use other values for G, and compare the model outputs:
    - Log liklihood –>less negative
    - BIC –>lowest value
    –>also used for model comparisoon
22
Q

How can the solution of the Laten class analysis be tested?

A
  1. Check mean values using aggregate()
  2. Plot the LCA clusters
  3. compare predicted class memberships
23
Q

How can the outcome of the Naive Bayes Classification be tested?

A
  1. use test data to predict() values based on trained model using test data
  2. Revview the segment frequencies and compare to the inital a-priori frequencies based on the training data
24
Q

How can the performance of the Naiive Bayes Classification and Random forest be assed?

A
  1. Considering raw agreement rate
    mean(raw$Segment == predition) = 0.92 –>92% correct prediction
  2. compare performance against random chance using ARI
    - 1 = perfect agreement, 0 = random -1 = complete disagreement
  3. asses performance for each different class using Confusion Matrix
    - actual segment is left (rows)
    - predicted (columns
25
How can the outcome of the **Random Forest** be tested?
1. use **test data** to predict() values based on trained model **using test data** 2. plot the **clusters** based on test data
26
How can the performance of the **Random forrest** be assed?
1. compare performance against random chance using **ARI** (adjustedRandIndex) 1= perfect agreement, 0 = random -1 = complete disagreement 2. asses performance for **each** different class using **Confusion Matrix** using **test data** - actual segment is left (**rows**) - predicted (**columns**
27
What is meant by **importance** analysis in the **Random Forest model**?
the model uses many **predictor variables**, thus it is useful to know the **importance of different** classification variables -->randomForest( importance = **TRUE**)
28
What is the **Class imbalance**? and how can it be resolved in RandomForrest models?
using randomForest for **prediction** the model might generate values with 90% **being in one group** -->**imbalance** -->**Resolving**: - looking at frequency table of the **training data**, to see the group **allocation** -->pick **smallest** - SSet **sampsize**= "value" in randonForest model
29
What are the steps of Market basket analysis?
If **Non-transaction** data — can only handle **discrete and categorical values** 1. if numeric **converted** to **ordered factors** using cut() cut(data$**variable**, **breaks** (cut points intervals=), **labels**= labels to resulting intervals (names), **ordered_result=** ordered factor? -->**TRUE**!!, **right** = **FALSE** = left closed!! 2. convert **into formal transction object** using as(x, "**transactions"**) 3. find **associations rules** using apriori() -->**inspect rules** using inspect() --> plot rule **confidence** against **support**
30
What is the output of Latent CLass analysis consist of?
1. Top: **Conditiona item probabilities** (for each **predictor**) 2. Bottom: - **estimated** class population shares - **Predicted CLass membership** (modal posterior prob)
31
What are the outputs of **Naive Bayes**?
Top: **A-priori probabilitites** per segement (class membership Middle: **Condition probabilities** for each **predictor**
32
What are the outputs of **random forest**?
1. **Confusion matrix** with class error 2. OOB estimate of error rate: 3. **but no predictor class membership**
33
What is a formal transaction object?
Represents a **set of transactions** where **each transaction** consists of a **unique identifier** and a collection of items
34
How to get the **estimated likelihoods for each respondent** belonging to **all different** segments? (Naive bayes
Setzen Type**= raw** All probabilities for each respondent for all segment