Text Mining Flashcards

Question 1

Q

NLP

Answer

A

Lexical
Syntactic
Semantic
Pragmatic
Inference

Question 2

Q

Information Retrieval

Answer

A

Locating relevant documents with respect to an input
Two modes of text access
	Pull Mode (search engines)
	Push Mode (recommender)
Text Retrieval Methods
	Document selection (keyword)
	Document ranking (similarity)

Question 3

Q

Document Similarity (Vector Space Model - Bag of Words)

Answer

A

Document and Query represented in high dimensional space.
Relevance Measured with appropriate similarity measure.
Document Ranking
Retrieval models
Similarity: sim(q,d)
Probabilistic: p(R=1|d,q)
Prob inference: p(d->q)
Axiomatic: set of constraints
VSM: number of distinct query words matched in d
TFW: count of words
IDF: log[(M+1)/k]

Question 4

Q

IDF + TF

Answer

A

c(w,q)c(w,d)log((M+1)/df(w))

Question 5

Q

TF WeighT

Answer

A

y=x
y=log(1+x)
y=log(1+log(1+x))

Question 6

Q

BM25

Answer

A

y = (k+1) c(w,d) / (c(w,d)+k)

Question 7

Q

Document Length Normalization

Answer

A

Penalize a long doc with a doc leng normalizer.
Pivoted length normalizer: avg doc length as pivot
1 if doc length = to avg
norm = 1-b+b(|d|/avdl)
VSM: ln(1+ln(1+c(w,d)))/(norm)
BM25/Okapi: (k+1) c(w,d) / (c(w,d)+k*(norm))

Question 8

Q

From Texto to Numerical

Answer

A

Keywords Selection: tokenization
Stop word and stemming used to isolate significant keywords.
	Stop words: a, the, always
	Word stemming: common prefix (computer, computing, computerize)
Dimensionality Reduction
	Latent Semantic Indexing (LSI)
	Locality Preserving Indexing (LPI)
	Probabilistic Semantic Indexin (PLSI)

Question 9

Q

Latent Semantic Indexing

Answer

A

xi vectors representing documents
X: all set of documents
SVD —> X=UEV^(T)

Question 10

Q

Text Classification

Answer

A

Labeling text documents on Topic, Style, Purpose.
Similarity: information retrieval + knearest. k most similar documents are retrived
Dimensionality Reduc: distribution of keywords. Can apply classification techniques.
Naive Bayes

Question 11

Q

Word Embeddings

Answer

A

Words with similar meanings have similar representations.
Individual words as real valued vectors.
one word mapped to one vector 
vector values are leaned “like” neural
Bag of Word Model

Question 12

Q

Word Embedding Applications

Answer

A

Word Similarity
Machine Translation
Relation Extraction
Sentiment Analysis: average word embeddings using TD/IDF weight. Then classify + or -
Co-reference Resolutions: chaining entity mentions across multiple documents.
Clustering: Similar class, similar contexts.
Semantic Analysis: word distributions for various topics.

Question 13

Q

Bag of Words

Answer

A

One hot encoding
One bit position in huge vector
Context information is not used

Question 14

Q

Word2Vec

Answer

A

Skip gram neural network architecture.
trains a simple neural network with a single hidden layer
Predict probability for every word vocabulary
Output vector is a probability distribution

Question 15

Q

Naive Bayes for Text

Answer

A

Doc Categories (C1, C2, .... Cn)
Doc to classfiy (D)
Prob model: P(Ci|D) = P(D|Ci)*P(Ci)/P(D)
We chose: argmax P(D|C)P(C)
Keywords distributions are inter-indep. and order-indep.

M-estimate: P(wk|C) = Nc,k + 1 / Nc + |Vocab|

Question 16

Q

Clustering Text

Answer

Study These Flashcards

A

Tokenizing and stemming
Calculate into vector with TF-IDF
Calculate cosine distance btw each doc.
Cluster documents using similarity.

Question 17

Q

Text clustering preprocessing

Answer

Study These Flashcards

A

-Bag of Words
Punctuation
N Chars
Number filter
Case converter
Stop word filter
Porter Stemmes 
Term Grouper
- Keygraph keyword extractor

Text Mining Flashcards

(17 cards)