Week 6: Instance-Based Learning Flashcards
Instance-based Classification
To classify a new instance, search the memory for the training instance that most resembles the new instance. A distance metric is used to compare the training instances.
Main components:
- Distance Function, which calculates the distance between any two instances.
- Combination function, which combines results from neighbours to arrive at an answer
Pros:
- Not restricted by data format
- Efficiently adaptable to the additional of new instances
- Requires relatively low training effort
Cons:
- Memory intensive
- Can be more time consuming that neural networks or decision trees
- Making suitable distance and combination functions can be tough
Similarity/Distance
This is the means by which different training instances can be measured for suitability with the new instance.
Absolute Difference
\left\lvert x_{i,j} - x_{i^{‘}, j} \right\rvert
Squared Difference
(x_{i,j} - x_{i^{‘},j})^2
Normalised Absolute Value
\frac{\left\lvert x_{i,j} - x_{i^{‘},j} \right\rvert}{\max_j - \min_j}
Absolute Difference of Standardised Values
\frac{\left\lvert x_{i,j} - \mu_j \right\rvert}{\sigma_j} - \frac{\left\lvert x_{i^{‘},j} - \mu_j \right\rvert}{\sigma_j}
Distance Metric for Nominal Attributes
d(x_{i,j}, x_{i^{‘},j}) = \left{ \begin{array}{cc} 0, & if x_{i,j} = v and x_{i^{‘},j} = v \ 1 & if x_{i,j} = v and x_{i^{‘},j} \ne v \end{array}
Euclidean Distance
d(\boldsymbol{x}i, \boldsymbol{x}{i^{‘}}) = \sqrt{\sum_{j=1}^n(x_{i,j} - x_{i^{‘},j})^2}
Manhattan Distance
d(\boldsymbol{x}i, \boldsymbol{x}{i^{‘}}) = \sum_{j=1}^n \left\lvert x_{i,j} - x_{i^{‘},j} \right\rvert
Weighted Voting
In this setup, not all neighbours are treated equally. The weight of the vote is proportional to the similarity with the new instance. Weighted voting allows for preventing ties.
Spectrogram
Can represent audio data in the frequency domain. Use for identifying songs from audio samples.
In the case of music identification, the vertical axis correspond to music frequencies. The horizontal axis represents time. The data can be transformed into constellation plots.
Constellation Plot
A graph of the frequency peaks for a song in the frequency domain. It only has the spectrogram peaks. Each song is defined by a unique constellation of peaks in the frequency domain, with background noise minimised.
Metric for Finding Similar Songs
Number of peaks that the two songs have in common / Number of all distinct Peaks
Anchor Points
An anchor point is a peak in the constellation chart that is paired with other peaks that follow in the time axis within a specified range of frequencies.
In the context of song identification, each anchor-peak pair is associated with a time difference between the peak and the anchor pair, and the frequency difference between the peak and the anchor.
Song Identification Method
- Convert the new snippet into a constellation plot.
- Turn the constellation plot into anchor points with anchor-peak pairs.
- Identify matching anchor points between the listened song and songs in the database. Determine the longest consecutive sequence of overlap between the two.
- Return the song with the longest overlap, i.e. the nearest neighbour.