Clustering & classification Flashcards
What is the concept of distance when it comes to classification, why is it necessary?
What
A distance defines the dissimilarity between two points.
Why
Every classification method needs the concept of distance because we need to define a distance matrix between observations.
Methods
Two of the most common methods are:
- Pearson correlation distance
- Eisen cosine correlation distance
What types of hierarchical clustering methods are there? What are the pros and cons of these methods?
Types
- Agglomerative: each observation is considered as a cluster. Iteratively, the most similar clusters (leafs) are merged until one single cluster forms (root)
- Divisive: the inverse of the agglomerative approach. It begins with a single root and subsequently the most heterogeneous clusters are divided until each observation form a cluster.
- Visualisation is based on tree representation known as dendrogram
Pros
- No a-priori information about the number of clusters required
- Easy to implement
- Very replicable
Cons
- Not very efficient O(n2 log n)
- Based on dissimilarity matrix which has to be choosed in advance
- No objective function is directly minimised
- The dendrogram is not the best tool to choose the optimum number of clusters
- Hard to treat non-convex shapes
What types of partioning clustering methods are there? What are their pros and cons?
Types
- k-means: each cluster is represented by the center of the cluster
- k-medoids or PAM: each cluster is represented by one of the points in that cluster
- CLARA (Clustering LARge Applications): suitable when large datasets are analysed
Pros
- k-means is relatively efficient O(tkn), with k, t << n
- Simple approach, easy to implement and understand
- totally replicable
Cons
- PAM does not scale well for large datasets
- Applicable only when mean is defined (i.e. no categorical data)
- Need to specify k in advance
- k-means is unable to handle noisy data and outliers. PAM does better
- Not suitable to discover clusters with non-convex shapes
What is the Naïve Bayes approach?
Naïve Bayes is a probabilistic machine learning algorithm and can be used for filtering spam, classifying documents etc..
It relies on two concepts>
- Conditional probability
- Bayes rule
*