Text Mining Flashcards

1
Q

NLP

A
Lexical
Syntactic
Semantic
Pragmatic
Inference
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Information Retrieval

A
Locating relevant documents with respect to an input
Two modes of text access
	Pull Mode (search engines)
	Push Mode (recommender)
Text Retrieval Methods
	Document selection (keyword)
	Document ranking (similarity)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Document Similarity (Vector Space Model - Bag of Words)

A

Document and Query represented in high dimensional space.
Relevance Measured with appropriate similarity measure.
Document Ranking
Retrieval models
Similarity: sim(q,d)
Probabilistic: p(R=1|d,q)
Prob inference: p(d->q)
Axiomatic: set of constraints
VSM: number of distinct query words matched in d
TFW: count of words
IDF: log[(M+1)/k]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

IDF + TF

A

c(w,q)c(w,d)log((M+1)/df(w))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

TF WeighT

A

y=x
y=log(1+x)
y=log(1+log(1+x))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

BM25

A

y = (k+1) c(w,d) / (c(w,d)+k)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Document Length Normalization

A

Penalize a long doc with a doc leng normalizer.
Pivoted length normalizer: avg doc length as pivot
1 if doc length = to avg
norm = 1-b+b(|d|/avdl)
VSM: ln(1+ln(1+c(w,d)))/(norm)
BM25/Okapi: (k+1) c(w,d) / (c(w,d)+k*(norm))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

From Texto to Numerical

A
Keywords Selection: tokenization
Stop word and stemming used to isolate significant keywords.
	Stop words: a, the, always
	Word stemming: common prefix (computer, computing, computerize)
Dimensionality Reduction
	Latent Semantic Indexing (LSI)
	Locality Preserving Indexing (LPI)
	Probabilistic Semantic Indexin (PLSI)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Latent Semantic Indexing

A

xi vectors representing documents
X: all set of documents
SVD —> X=UEV^(T)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Text Classification

A

Labeling text documents on Topic, Style, Purpose.
Similarity: information retrieval + knearest. k most similar documents are retrived
Dimensionality Reduc: distribution of keywords. Can apply classification techniques.
Naive Bayes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Word Embeddings

A
Words with similar meanings have similar representations.
Individual words as real valued vectors.
one word mapped to one vector 
vector values are leaned “like” neural
Bag of Word Model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Word Embedding Applications

A

Word Similarity
Machine Translation
Relation Extraction
Sentiment Analysis: average word embeddings using TD/IDF weight. Then classify + or -
Co-reference Resolutions: chaining entity mentions across multiple documents.
Clustering: Similar class, similar contexts.
Semantic Analysis: word distributions for various topics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Bag of Words

A

One hot encoding
One bit position in huge vector
Context information is not used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Word2Vec

A

Skip gram neural network architecture.
trains a simple neural network with a single hidden layer
Predict probability for every word vocabulary
Output vector is a probability distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Naive Bayes for Text

A
Doc Categories (C1, C2, .... Cn)
Doc to classfiy (D)
Prob model: P(Ci|D) = P(D|Ci)*P(Ci)/P(D)
We chose: argmax P(D|C)P(C)
Keywords distributions are inter-indep. and order-indep.

M-estimate: P(wk|C) = Nc,k + 1 / Nc + |Vocab|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Clustering Text

A

Tokenizing and stemming
Calculate into vector with TF-IDF
Calculate cosine distance btw each doc.
Cluster documents using similarity.

17
Q

Text clustering preprocessing

A
-Bag of Words
Punctuation
N Chars
Number filter
Case converter
Stop word filter
Porter Stemmes 
Term Grouper
- Keygraph keyword extractor