13 Text Classification and Naive Bayes Flashcards by Henrik Hortemo

Standing query

Is like any other query except that is is periodically executed on a collection to which new docs are incremantally added over time.

How well did you know this?

Not at all

Perfectly

Classification

Given a set of classes, we seeks to determine which class(es) a given object belongs to.

How well did you know this?

Not at all

Perfectly

Routing/filtering

Classification using standing queries.

How well did you know this?

Not at all

Perfectly

Topics

A general class is usually referred to as topic. For instance ÒChinaÓ or ÒcoffeÓ

How well did you know this?

Not at all

Perfectly

Applications of classification in IR

Preprocessing steps Finding a docs encoding, truecasing, and identifying the language of a doc Spam Automatic detection of spam pges, which are then not included in the search engine index Porn mode Filter out Sentiment detection The automatic classification of a movie or prduct review as positive or neg- ative. Personal email sorting Finding the correct folder for a new email. Topic-specific or vertical search Vertical search engines restrict searches to a partical topic. For example, the query Òcompute scienceÓ on a fertical search engine for the topic China will return a list of Chinese computer science departments with higher precision and recall than the query Òcomputer science ChinaÓ on a general prupose search engine. Ranking function The ranking function in ad hoc IR can also be based on a document classifier. More specified later. (sec. 15.4)

How well did you know this?

Not at all

Perfectly

Rules in text classification (TC)

A rule captures a certain combination of keywords that indicates a class. Hand-coded rules have good scaling properties, but creating and maintaining them over time is labor intensive. In machine learning these rules are learned automatically from training data

How well did you know this?

Not at all

Perfectly

Statistical text classification

The approach where where rules are learning automatically with machine learning.

How well did you know this?

Not at all

Perfectly

Labeling

The process of annotating each doc with its class.

How well did you know this?

Not at all

Perfectly

Document space

In TC we are given a description d ∈ X of a document where X is the docu- ment space. Typically, the document space is some type of high-dimensional space.

How well did you know this?

Not at all

Perfectly

Space class

In TC, we are given a fixed set of classes C = {c1, c2,…,cj} Typically, the classes are human defined for the needs of an application.

How well did you know this?

Not at all

Perfectly

Training set

In TC, we are usually given a training set D of labeled docs <d> where <d> ∈ X x C</d></d>

How well did you know this?

Not at all

Perfectly

Learning method classifier

Using a learning method or learning algorithm, we then with to learn a classificer or classification function ɣ that maps documents to classes: ɣ : X -> C

How well did you know this?

Not at all

Perfectly

Surpervised learning

The above type over learning is called supervised learning because a su- pervisor serves as a teacher directing the learning process. We denote the supervised learning method _ and write _(D) = _. The learning method _ takes the training set D as input and return the learning classification function _

How well did you know this?

Not at all

Perfectly

Test set data

Oncle we have learning ɣ, we can apply it to the test set, for example, the new doc, “first rpivate Chinese airline” whose class in unknown. The classification function hopefylle assigns the new document to class ɣ(d’) = China.

How well did you know this?

Not at all

Perfectly

Spareseness

The training data are never large enough to represent the frequency of rare events adequately. Therefore, the probability will often be zero.

How well did you know this?

Not at all

Perfectly

Bernoulli model

Study These Flashcards

Equivalent to the binary independence model, which generates an indicator for each term of the vocabulary, either 1 indicating presence of the term in the doc or 0 indicating absence.

Concept drift

Study These Flashcards

The graudal change over time of the concept underlying a lcass like US pres- ident from George W. Bush to Barack Obama. The Bernoulli model is par- ticularly robust with respect to this because the most important indicators of a class are less likely to change.

Feature selection

Study These Flashcards

Is the process of selecting a subset of the terms occuring the training set and using only this subset as features in TC. Two main purposes:

It makes training and applying a classifier more efficient by decreasing the size of the vocabulary. Important for classifiers that are expensive to train (unlike NB)
It often increases classification accuracy by elimination noise features

Noise feature

Study These Flashcards

Is one that, when added to the doc representation, increases the classification error on new data. Example Suppose are rare term, say arachnocentric, has no info about a class, say China, but all instances of arachnocentric happen to occur in China docs in out training set. Then the learning method might produce a classifier that misassigns test docs containing arachnocentric to China.

Overfitting

Study These Flashcards

Such an incorrect generalization from the example above from an accidental property of the training set is called overfitting.

Mutual information (MI)

Study These Flashcards

MI measures how much info the presence/absence of a term contributes to making the correct classification decision.

X² feature selection

Study These Flashcards

In statistics, the X² test is apllied to test the independce of two events. In feature selection, the two events are occurerence of the term and occurrence of the class. A high X² value indicates that the hypothesis of independence, which implies that expected and observed counts are similar, are inncorrect.

Frequency-based feature selection

Study These Flashcards

A feature selection method. It selects some frequent terms that have no specific info about the class.

Greedy feature selection

Study These Flashcards

All the tree feature selection methods desrbed (MI, X2 and frequency based feature selection) are examples of greedy methods. They may select features that contribute no incremental information over previously selected features.

Two-class classifier

An approach to an any-of problem. You must learn several two-class classi- fier, one for each class, where the two-class classifier for class c is the classifier for the two classes c and its complement c̄

Effectiveness

Is a generic term for measures that ealuate the quality of classification deci- sions, including precision, recall, F1 and accuaracy.

Performance

Refers to the computational efficiency of classification and IR systems.

Macro/Micro- averaging

We often want to compute a single aggregate measures that combines the measures for individual classifier. There are two methods for doing this. **Macroaveraging** Computes a simple average over classes **Microaveraging** Pools per-doc devision across classes, and then computes an effectiveness measure on the pooled contigency table. The differences between the two methods can be large. Macroaveraging gives equacl weights to each class, whereas microaveraging gives equal weight to each per-doc classification decision.

Development set

A set for testing while you develop your methods.

13 Text Classification and Naive Bayes Flashcards

(29 cards)