13 Text Classification and Naive Bayes Flashcards
Standing query
Is like any other query except that is is periodically executed on a collection to which new docs are incremantally added over time.
Classification
Given a set of classes, we seeks to determine which class(es) a given object belongs to.
Routing/filtering
Classification using standing queries.
Topics
A general class is usually referred to as topic. For instance ÒChinaÓ or ÒcoffeÓ
Applications of classification in IR
Preprocessing steps Finding a docs encoding, truecasing, and identifying the language of a doc Spam Automatic detection of spam pges, which are then not included in the search engine index Porn mode Filter out Sentiment detection The automatic classification of a movie or prduct review as positive or neg- ative. Personal email sorting Finding the correct folder for a new email. Topic-specific or vertical search Vertical search engines restrict searches to a partical topic. For example, the query Òcompute scienceÓ on a fertical search engine for the topic China will return a list of Chinese computer science departments with higher precision and recall than the query Òcomputer science ChinaÓ on a general prupose search engine. Ranking function The ranking function in ad hoc IR can also be based on a document classifier. More specified later. (sec. 15.4)
Rules in text classification (TC)
A rule captures a certain combination of keywords that indicates a class. Hand-coded rules have good scaling properties, but creating and maintaining them over time is labor intensive. In machine learning these rules are learned automatically from training data
Statistical text classification
The approach where where rules are learning automatically with machine learning.
Labeling
The process of annotating each doc with its class.
Document space
In TC we are given a description d ∈ X of a document where X is the docu- ment space. Typically, the document space is some type of high-dimensional space.
Space class
In TC, we are given a fixed set of classes C = {c1, c2,…,cj} Typically, the classes are human defined for the needs of an application.
Training set
In TC, we are usually given a training set D of labeled docs <d> where <d> ∈ X x C</d></d>
Learning method classifier
Using a learning method or learning algorithm, we then with to learn a classificer or classification function ɣ that maps documents to classes: ɣ : X -> C
Surpervised learning
The above type over learning is called supervised learning because a su- pervisor serves as a teacher directing the learning process. We denote the supervised learning method _ and write _(D) = _. The learning method _ takes the training set D as input and return the learning classification function _
Test set data
Oncle we have learning ɣ, we can apply it to the test set, for example, the new doc, “first rpivate Chinese airline” whose class in unknown. The classification function hopefylle assigns the new document to class ɣ(d’) = China.
Spareseness
The training data are never large enough to represent the frequency of rare events adequately. Therefore, the probability will often be zero.