Lecture 5 - Distance based models Flashcards

1
Q

What are distance based algorithms

A

Distance based algorithms are machine learning algorithms that classify instances by computing distances between these instances and a number of internally stored exemplars.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

____ that are closest to the instance have the largest influence on the classification assignment to the instance

A

Exemplars

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is hamming distance

A

Hamming distance between two strings or vectors of equal length is the number of positions at which the corresponding symbols are different.
110 and 101 hamming distance 2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is 0-norm,1-norm and 2-norm give examples.

A
  1. Hamming distance
  2. Manhattan distance
  3. Euclidean distance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the Chebyshev distance

A

The Chebyshev distance, also known as the maximum or Lāˆž distance, between two points in a space is the greatest of their differences along any coordinate dimension.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what are the 4 distance metric conditions

A
  1. Distances between a point and itself are 0
  2. All other distances are larger than zero if x!=y.
  3. Distances are symmetric
  4. Detours can not shorten the distance (tringle inequality)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are exemplars

A

Exemplars are prototypical instances within clusters/classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the two exemplars

A
  1. Centroid
  2. Medoid
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which one happens in data and which one doesnt

A

Centroids do not happen in data
Medoids happen in data, more time consuming to calculate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Since the number of classes is typically much lower than the number of exemplars, decision rules often take more than one nearest exemplar into account (or k-nearest exemplar)

A

true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the curse of dimentionality

A

In high-dimentional spaces everything is far away from everything and so pairwise distances are uninformative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

list nearest-neighbour classifier properties

A
  1. Nearly perfect seperation of classes on the training set
  2. Easily adapted to real-valued targets and structured objects
  3. Unbalanced complexity
  4. Perfect seperates training data
    - by increasing the number of neighbours k we would increase bias and decrease variance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what are 2 types of distance-based clustering

A

Predictive clustering
Descriptive clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is predictive clustering

A

use a distance metric, a way to construct exemplars and a distance-based decision rule to create clusters that are compact with respect to the distance metric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is descriptive clustering

A

trees called dendrograms purely defined in terms of a distance measure are used. they partition the given data rather than the entire instance space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is k-means clustering problem

A

The k-means clustering problem is to find a partition that minimises the total within-cluster scatter

17
Q

what are weak points

A

The algorithm can be impacted by the starting points, you also need to know the number of cluster in advance

18
Q

what is the complexity of medoid

A

Finding a medoid requires us to calutate for each data point the total distance to all other data points, in order to chose the point that minimises it. Regardless of the distance metric used, this is an O(n^2) operation for n points.

19
Q

how can we evaluate centroids

A

Inertia: the k-means algorithm aims to choose centroids that minimise the inertia or within-cluster scatter.
Inertia shows how internally coherent and compact clusters are

20
Q

what is a sillhouette

A

a silhouette then sorts and plots s(x) for each instance, grouped by cluster.
We want high value for b and low value for a

21
Q

What is hierarchial clustering

A

The k-means algorithms is flat. Hierarchical clustering provides a taxonomy of instances within the cluster, or where clusters can be merged with each other.

22
Q

what is a dendrogram

A

Given data set D, a dendrogram is a binary tree with the elements of D at its leaves. An internal node of the tree represents the subset of elements in the leaves of the subtree rooted at that node. The level of a node is the ditsance between the two clusters represented by the children of the node. leaves have level 0.

23
Q

What is a linkage function

A

A linkage function calculates the distance between arbitary subsets of the instance space, given a distance metric.

24
Q

What are the 4 linkage types

A

Single linkage
Complete linkage
Average linkage
Centroid linkage

25
Q

What are the best 3 linkage in order

A
  1. Complete linkage
  2. Centroid linkage
  3. Single linkage