Semantic Analysis - Text Classification Flashcards

Question 1

Q

Types of Classification

Answer

A

1) Content-based classification: at least two classes (binary classifier) or more than two (multi-class application). May require different types of classification. The shorter the text the less to go on. Supervised learning.
2) Descriptor-based classification: given a written description of what the classes of content are. Someone is making a request and there are no example documents yet. Legal discovery. Give me all company’s emails that discuss this product, etc. FOIA request. You have to write a description and then someone decides what matches that description. 10 or 30 sentences long so it’s classification by description rather than a search task. Less information today for descriptor-based text classification.

Question 2

Q

Subject-based Classifiction

Answer

A

Two Approaches to Subject-Based Automated Document Classifications:

1) Multinomial Naive Bayes classifiers
2) SVM-based classifiers

Question 3

Q

Naive Bayes Models

Answer

A

Class is what category. Predictor and the attributes or features used to make the prediction. Calculates the posterior probability of the class.

Question 4

Q

Prior Probability

Answer

A

Prior probability of the class: global distribution of individuals into that class. Look at the past.
Prior probability of the predictor: global distribution of the individuals into that predictor.

Question 5

Q

Posterior Probability

Answer

A

Posterior probability of the predictor: Probability of having the predictor attributes
Posterior probability of the class: Probability of falling into the target class. After the observation like term frequencies of the documents. Additional information that is taken into consideration.

Question 6

Q

Prior and Posterior

Answer

A

Before and after an observation

Question 7

Q

SVMs

Answer

A

Separate into hyperplanes
Maximized separation
When the maximum margin is found, then you are done
Features on either side of the line will predict new observations
Add dimensionality
Mod function of every number into two dimensions and then draws a line
Has a sense of outlier detection
Best fit algorithm (doesn’t have to be perfect separation)
Kernel tricks to transform the all the data in the same and then is able to find a dividing line in the training data
Hyperplane: It’s called hyperplane because a plane is 2 dimensions. A hyperplane has more than two dimensions.
Powerful, fast, one of our favorites for document classification.

Question 8

Q

Features

Answer

A

Are weighted term frequencies (TF-IDF)

Question 9

Q

Architecting a Classification System

Answer

A

1) Training docs. We have example documents that are sports news/not sports news
2) Sentence tokenizer
3) Word tokenizer
4) Stemmer
5) Possibly throw out stopwords, expand contractions
6) Feature Extraction to pull out features (the word the should not be a training feature)
7) Run it through a classifier (NB, SVM, etc.) to build a model
8) Test docs.
9) Normalize (follow 2 - 6 above)
10) Run it through the model and get a prediction
11) Compare to actual (model performance evaluation)

Question 10

Q

Descriptor-Based Classifiers - Case 1

Answer

A

Two-phase process:
1) Information Retrieval problem
User enters a robust description of the desired class fo documents. Pull out the chunks and use to do keyword searches based on the chunks. Examine all the docs and identifies strong hits. A strong hit is affected by the # of TF-IDF keywords found, phrase match, keyword sentence density of document, proximity of keywords, keywords found in title, subtitle, URL, filenames, image captions of alt tags, metadata keywords, hyperlinks, breadcrumb trail.

2) Content-Based Classification
Train, e.g., SVM machine on the strong hit documents

Question 11

Q

Descriptor-Based Classifiers - Case 2

Answer

A

Instead of a paragraph of description we could be given an empty taxonomy. I have news I want classified into sports, business, and politics, but they don’t have any training data with those labels.
-Get queries out of taxonomy because that’s all we have. Create Boolean queries. Sports and neither business nor politics, etc.
Works when taxonomy fits well

Question 12

Q

Multi-class vs. multi-label

Answer

A

multiple classes, mutually exclusive

multi-label, assigning multiple classes to a single observation

Question 13

Q

Chimera

Answer

A

Large-scale classification using machine learning, rules, and crowdsourcing

classify millions of product descriptions into 5000+ product types, new classes coming up all the time, classes may change over time (drift) - evolution
Combines learning and hand-crafted categories
Must be fast response time (business requirement)
High precision, low recall (better to not make a call than to make a bad classification)
Chimera system architecture (see ppt)
Voting Master, Crowd evaluation

How do you add a new rule when you have millions of customers? Can’t take the system offline. Add a band-aid fix. Add a line of code or fuzzy map, if you see new Uber drive, create a category on the fly. This is a rule-based system. Threshold, e.g., after 10 or 15 rules then need to update and retrain the model.

Add new training data that is categorized based on a rule?
Human evaluation component is here to stay

Question 14

Q

Product Title Classification versus Text Classification

Answer

A

Variant of short text classifications
Should we apply the standard NLP pre-processing pipeline to short-text?
Ebay example
Bigrams are more beneficial for short text (trigrams or higher don’t make sense because of the limited number of words)

Question 15

Q

Symbolic vs. Neural approaches for Product Search

Answer

A

-Neural IR is the future, but right now the ideal systems are symbolic and IRs. In the next two years we could transition to IR.
The NLP pipeline is fragile, many steps and there could be problems if there is a breakdown in the steps
-Image encoding can be used for things like “t-shirt logo on back”. This could be combined with text encoding.
-Get training set through user interactions such as clicks, add to wishlist, and purchase
-Zalando.com example
-

Semantic Analysis - Text Classification Flashcards

(15 cards)