Semantic Analysis - Text Classification Flashcards
Types of Classification
1) Content-based classification: at least two classes (binary classifier) or more than two (multi-class application). May require different types of classification. The shorter the text the less to go on. Supervised learning.
2) Descriptor-based classification: given a written description of what the classes of content are. Someone is making a request and there are no example documents yet. Legal discovery. Give me all company’s emails that discuss this product, etc. FOIA request. You have to write a description and then someone decides what matches that description. 10 or 30 sentences long so it’s classification by description rather than a search task. Less information today for descriptor-based text classification.
Subject-based Classifiction
Two Approaches to Subject-Based Automated Document Classifications:
1) Multinomial Naive Bayes classifiers
2) SVM-based classifiers
Naive Bayes Models
Class is what category. Predictor and the attributes or features used to make the prediction. Calculates the posterior probability of the class.
Prior Probability
Prior probability of the class: global distribution of individuals into that class. Look at the past.
Prior probability of the predictor: global distribution of the individuals into that predictor.
Posterior Probability
Posterior probability of the predictor: Probability of having the predictor attributes
Posterior probability of the class: Probability of falling into the target class. After the observation like term frequencies of the documents. Additional information that is taken into consideration.
Prior and Posterior
Before and after an observation
SVMs
- Separate into hyperplanes
- Maximized separation
- When the maximum margin is found, then you are done
- Features on either side of the line will predict new observations
- Add dimensionality
- Mod function of every number into two dimensions and then draws a line
- Has a sense of outlier detection
- Best fit algorithm (doesn’t have to be perfect separation)
- Kernel tricks to transform the all the data in the same and then is able to find a dividing line in the training data
- Hyperplane: It’s called hyperplane because a plane is 2 dimensions. A hyperplane has more than two dimensions.
- Powerful, fast, one of our favorites for document classification.
Features
Are weighted term frequencies (TF-IDF)
Architecting a Classification System
1) Training docs. We have example documents that are sports news/not sports news
2) Sentence tokenizer
3) Word tokenizer
4) Stemmer
5) Possibly throw out stopwords, expand contractions
6) Feature Extraction to pull out features (the word the should not be a training feature)
7) Run it through a classifier (NB, SVM, etc.) to build a model
8) Test docs.
9) Normalize (follow 2 - 6 above)
10) Run it through the model and get a prediction
11) Compare to actual (model performance evaluation)
Descriptor-Based Classifiers - Case 1
Two-phase process:
1) Information Retrieval problem
User enters a robust description of the desired class fo documents. Pull out the chunks and use to do keyword searches based on the chunks. Examine all the docs and identifies strong hits. A strong hit is affected by the # of TF-IDF keywords found, phrase match, keyword sentence density of document, proximity of keywords, keywords found in title, subtitle, URL, filenames, image captions of alt tags, metadata keywords, hyperlinks, breadcrumb trail.
2) Content-Based Classification
Train, e.g., SVM machine on the strong hit documents
Descriptor-Based Classifiers - Case 2
Instead of a paragraph of description we could be given an empty taxonomy. I have news I want classified into sports, business, and politics, but they don’t have any training data with those labels.
-Get queries out of taxonomy because that’s all we have. Create Boolean queries. Sports and neither business nor politics, etc.
Works when taxonomy fits well
Multi-class vs. multi-label
multiple classes, mutually exclusive
multi-label, assigning multiple classes to a single observation
Chimera
Large-scale classification using machine learning, rules, and crowdsourcing
- classify millions of product descriptions into 5000+ product types, new classes coming up all the time, classes may change over time (drift) - evolution
- Combines learning and hand-crafted categories
- Must be fast response time (business requirement)
- High precision, low recall (better to not make a call than to make a bad classification)
- Chimera system architecture (see ppt)
- Voting Master, Crowd evaluation
How do you add a new rule when you have millions of customers? Can’t take the system offline. Add a band-aid fix. Add a line of code or fuzzy map, if you see new Uber drive, create a category on the fly. This is a rule-based system. Threshold, e.g., after 10 or 15 rules then need to update and retrain the model.
- Add new training data that is categorized based on a rule?
- Human evaluation component is here to stay
Product Title Classification versus Text Classification
- Variant of short text classifications
- Should we apply the standard NLP pre-processing pipeline to short-text?
- Ebay example
- Bigrams are more beneficial for short text (trigrams or higher don’t make sense because of the limited number of words)
Symbolic vs. Neural approaches for Product Search
-Neural IR is the future, but right now the ideal systems are symbolic and IRs. In the next two years we could transition to IR.
The NLP pipeline is fragile, many steps and there could be problems if there is a breakdown in the steps
-Image encoding can be used for things like “t-shirt logo on back”. This could be combined with text encoding.
-Get training set through user interactions such as clicks, add to wishlist, and purchase
-Zalando.com example
-