14 Vector Space Classification Flashcards
Vector space classification
Goal: To develop a different representation for TC, the vector space model from chapter 6. Different from the rep used in last chapter where the doc rep in Naive Bayes was a sequence of terms or a binary vector.
Contiguity hypothesis
Document in the same class form a contigous regin and regions of different classes do not overlap.
Prototype
A centroid in the Rocchio classifier.
Decision boundary
Boundaries to separate the classes. To classify a new document, we deter- mine the region it occurs in and assign it the class of that region.
Multimodal class
A class with several logical clusters.
k nearest neighbour/kNN classification
A classifier who determines the decision boundary locally. For 1NN we assign each doc to the class of its closes neighbour. For kNN we assign each doc to the majority class of its k closest neighbours where k is a parameter. The rationale of kNN classification is that, based on the contiguity hypohothesis, we expect a test doc d to have the same label as the training docs located in the local region surrounding d.
Memory based learning/instance-based learning
kNN is an example of this. We do not perform any estimation of parameters. kNN simply memorizes all examples in the training set and then compares the test doc to them.
Bayes error rate
A measure of the quality of a learning method. The average error rate of classifiers learned by it for a particular problem.
Bias-variance tradeoff
There is a tradeoff when selecting a classifier that produces good classifiers across sets (small variance) and that can learn classification problems with very difficult decision boundaries (small bias).
Bias
Bias is large if the learning method produces classifiers that are consistently wrong. Bias is small if:
- The classifiers are consistently right
- Different training sets cause errors on different documents
- Different training sets cause positive and negative errors on the same docs, but that average out close to 0.
Linear method like Rocchio and NB have high bias for nonlinear problems Nonlinear methods like kNN have a low bias, because the decision boundaries are variable, depending on the distribution of docs in the training set, learned decision boundaries can vary greatly.
Variance
Is the variation of the prediction of learned classifiers. Is large if different give rise to very different classifiers. It is small if the training set has a minor effect on the classification decision the classifier makes, be they correct or incorrect. Variance measures how inconsistent the dections are, not wheter they are correct or incorrect.