Clustering & classification Flashcards

1
Q

What is the concept of distance when it comes to classification, why is it necessary?

A

What

A distance defines the dissimilarity between two points.

Why

Every classification method needs the concept of distance because we need to define a distance matrix between observations.

Methods

Two of the most common methods are:

  • Pearson correlation distance
  • Eisen cosine correlation distance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What types of hierarchical clustering methods are there? What are the pros and cons of these methods?

A

Types

  1. Agglomerative: each observation is considered as a cluster. Iteratively, the most similar clusters (leafs) are merged until one single cluster forms (root)
  2. Divisive: the inverse of the agglomerative approach. It begins with a single root and subsequently the most heterogeneous clusters are divided until each observation form a cluster.
  • Visualisation is based on tree representation known as dendrogram

Pros

  • No a-priori information about the number of clusters required
  • Easy to implement
  • Very replicable

Cons

  • Not very efficient O(n2 log n)
  • Based on dissimilarity matrix which has to be choosed in advance
  • No objective function is directly minimised
  • The dendrogram is not the best tool to choose the optimum number of clusters
  • Hard to treat non-convex shapes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What types of partioning clustering methods are there? What are their pros and cons?

A

Types

  • k-means: each cluster is represented by the center of the cluster
  • k-medoids or PAM: each cluster is represented by one of the points in that cluster
  • CLARA (Clustering LARge Applications): suitable when large datasets are analysed

Pros

  • k-means is relatively efficient O(tkn), with k, t << n
  • Simple approach, easy to implement and understand
  • totally replicable

Cons

  • PAM does not scale well for large datasets
  • Applicable only when mean is defined (i.e. no categorical data)
  • Need to specify k in advance
  • k-means is unable to handle noisy data and outliers. PAM does better
  • Not suitable to discover clusters with non-convex shapes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the Naïve Bayes approach?

A

Naïve Bayes is a probabilistic machine learning algorithm and can be used for filtering spam, classifying documents etc..

It relies on two concepts>

  • Conditional probability
  • Bayes rule
    *
How well did you know this?
1
Not at all
2
3
4
5
Perfectly