Text Mining Flashcards
NLP
Lexical Syntactic Semantic Pragmatic Inference
Information Retrieval
Locating relevant documents with respect to an input Two modes of text access Pull Mode (search engines) Push Mode (recommender) Text Retrieval Methods Document selection (keyword) Document ranking (similarity)
Document Similarity (Vector Space Model - Bag of Words)
Document and Query represented in high dimensional space.
Relevance Measured with appropriate similarity measure.
Document Ranking
Retrieval models
Similarity: sim(q,d)
Probabilistic: p(R=1|d,q)
Prob inference: p(d->q)
Axiomatic: set of constraints
VSM: number of distinct query words matched in d
TFW: count of words
IDF: log[(M+1)/k]
IDF + TF
c(w,q)c(w,d)log((M+1)/df(w))
TF WeighT
y=x
y=log(1+x)
y=log(1+log(1+x))
BM25
y = (k+1) c(w,d) / (c(w,d)+k)
Document Length Normalization
Penalize a long doc with a doc leng normalizer.
Pivoted length normalizer: avg doc length as pivot
1 if doc length = to avg
norm = 1-b+b(|d|/avdl)
VSM: ln(1+ln(1+c(w,d)))/(norm)
BM25/Okapi: (k+1) c(w,d) / (c(w,d)+k*(norm))
From Texto to Numerical
Keywords Selection: tokenization Stop word and stemming used to isolate significant keywords. Stop words: a, the, always Word stemming: common prefix (computer, computing, computerize) Dimensionality Reduction Latent Semantic Indexing (LSI) Locality Preserving Indexing (LPI) Probabilistic Semantic Indexin (PLSI)
Latent Semantic Indexing
xi vectors representing documents
X: all set of documents
SVD —> X=UEV^(T)
Text Classification
Labeling text documents on Topic, Style, Purpose.
Similarity: information retrieval + knearest. k most similar documents are retrived
Dimensionality Reduc: distribution of keywords. Can apply classification techniques.
Naive Bayes
Word Embeddings
Words with similar meanings have similar representations. Individual words as real valued vectors. one word mapped to one vector vector values are leaned “like” neural Bag of Word Model
Word Embedding Applications
Word Similarity
Machine Translation
Relation Extraction
Sentiment Analysis: average word embeddings using TD/IDF weight. Then classify + or -
Co-reference Resolutions: chaining entity mentions across multiple documents.
Clustering: Similar class, similar contexts.
Semantic Analysis: word distributions for various topics.
Bag of Words
One hot encoding
One bit position in huge vector
Context information is not used
Word2Vec
Skip gram neural network architecture.
trains a simple neural network with a single hidden layer
Predict probability for every word vocabulary
Output vector is a probability distribution
Naive Bayes for Text
Doc Categories (C1, C2, .... Cn) Doc to classfiy (D) Prob model: P(Ci|D) = P(D|Ci)*P(Ci)/P(D) We chose: argmax P(D|C)P(C) Keywords distributions are inter-indep. and order-indep.
M-estimate: P(wk|C) = Nc,k + 1 / Nc + |Vocab|