14 Vector Space Classification Flashcards

1
Q

Vector space classification

A

Goal: To develop a different representation for TC, the vector space model from chapter 6. Different from the rep used in last chapter where the doc rep in Naive Bayes was a sequence of terms or a binary vector.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Contiguity hypothesis

A

Document in the same class form a contigous regin and regions of different classes do not overlap.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Prototype

A

A centroid in the Rocchio classifier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Decision boundary

A

Boundaries to separate the classes. To classify a new document, we deter- mine the region it occurs in and assign it the class of that region.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Multimodal class

A

A class with several logical clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

k nearest neighbour/kNN classification

A

A classifier who determines the decision boundary locally. For 1NN we assign each doc to the class of its closes neighbour. For kNN we assign each doc to the majority class of its k closest neighbours where k is a parameter. The rationale of kNN classification is that, based on the contiguity hypohothesis, we expect a test doc d to have the same label as the training docs located in the local region surrounding d.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Memory based learning/instance-based learning

A

kNN is an example of this. We do not perform any estimation of parameters. kNN simply memorizes all examples in the training set and then compares the test doc to them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Bayes error rate

A

A measure of the quality of a learning method. The average error rate of classifiers learned by it for a particular problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Bias-variance tradeoff

A

There is a tradeoff when selecting a classifier that produces good classifiers across sets (small variance) and that can learn classification problems with very difficult decision boundaries (small bias).

Bias
Bias is large if the learning method produces classifiers that are consistently wrong. Bias is small if:

  1. The classifiers are consistently right
  2. Different training sets cause errors on different documents
  3. Different training sets cause positive and negative errors on the same docs, but that average out close to 0.

Linear method like Rocchio and NB have high bias for nonlinear problems Nonlinear methods like kNN have a low bias, because the decision boundaries are variable, depending on the distribution of docs in the training set, learned decision boundaries can vary greatly.

Variance
Is the variation of the prediction of learned classifiers. Is large if different give rise to very different classifiers. It is small if the training set has a minor effect on the classification decision the classifier makes, be they correct or incorrect. Variance measures how inconsistent the dections are, not wheter they are correct or incorrect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly