Natural Language Processing Flashcards
1
Q
NLP method summary
A
- find data: instead of spending months on unsupervised machine learning, take a couple weeks to label data
-
clean data (CTWMLL):
- irrelevant Characters
- Tokenize by separating into individual words
- irrelevant Words (such as twitter mentions or urls)
- consider Misspelled
- Lowercase
- consider Lemmatization
- find good data representation: i.e. - bag of words
- classification: stick with simplest for needs (I.e - logistic regression)
- inspection: confusion matrix
-
leveraging semantics
- method like Word2Vec (maybe average sentence representations)
- lose explainability, thus we should use tools like LIME
- end-to-end syntax: use convolutions or transformers where order matters
2
Q
term-document
A
- bag of words method
- n-terms by m-documents
- raw word count
- binary: 1 if the word appears in document else 0
- word frequency: 100*(raw count)/(total count)
- TF-IDF
3
Q
term frequency–inverse document frequency (TF-IDF)
A
- increases proportionally to the number of times a word appears in the document and is offset by the number of OTHER documents in the corpus that contain the word
- idf = log(N/{d∈D:t∈d}), denominator is number of documents with term
- shape V x D
4
Q
problem with wordcount
A
- stopwords like “the” occur in all documents
- really long documents have really big eigenvalues
5
Q
lemmatization
A
- determining the lemma of a word based on its intended meaning
- depends on correctly identifying the intended part of speech and meaning of a word in a sentence
- more accurate than stemming
6
Q
main areas of NLP
A
- sentiment analysis: determine writer’s positive, negative, or neutral feelings towards a particular topic, product
- named entity recognition: breaks sentences into names, organizations, etc…
- part of speech tag: label the noun, pronoun, adjective, determiner, verb, adverb, preposition, conjunction, and interjection (i.e. use vertirbi)
- latent semantic analysis: use methods like PCA get representation in a smaller domain (embeddings)
7
Q
word embedding
A
- equates to mapping of words
- can use GloVe or Word2Vec into new net
- can perform word analogy on these words
- idx is the word2idx position
- word along row, document along column
- latent feature D, We=V x D, D << N
- autoencoders/PCA/SVD create embedding
8
Q
cosine distance
A
- find similtude between two vectors
- used instead of euclidean distance for words
- cos_dist = 1 - aTb/(||a|| ||b||)
- aTb = ||a|| ||b|| cos(a,b)
9
Q
one hot encoding
A
- s = [0, 0, …, 1, …, 0] (1 x V)
- x = sWe
- the input returns row of encoded vector
- significantly reduces data size
- X is an index instead of a matrix
10
Q
t-SNE
A
- t-distributed stochastic neighbor embedding
- non-linear; no transformation model (modifies output itself)
- no train and test set
- no transforming after fitting
11
Q
n-gram models
A
- a sequence of consecutive words
- bigram: use past as inputs p(w|w_n-1)
- trigrams: use past and future words as inputs p(w|w_n-1, w_n+1)
12
Q
chain rule of probability
A
- apply bayes rule again and again: p(A, B, C, D)=p(D|C)p(C|B)p(B|A)p(A)
- p(A) = word count / corpus length
- p(B|A) is part of the bigram model
- p(C|A, B) is trigram: count(A, B, C) / count(A, B)
13
Q
markov assumption
A
- current state only relies on previous state
- 1st order Markov: condition on 1 word (bigram)
- this is typically what we consider markov model
- 2nd order Markov: condition on 2 words (trigram)
14
Q
recursive neural tensor network (RNTN)
A
method for state of the art sentiment analyzer
15
Q
training sentence generator
A
- tokenize each sentence
- map word to index
- save each sentence as a list of indices
- INPUT: [START, x0, x1, …, xN]
- TARGET: [x0, x1, …, xN, END]
- accuracy is number of words guessed correct out of all of the number of words
16
Q
how to calculate probability of word sequence
A
- chain rule
- log likelihood to deal with small numbers
- random note: reverses softmax operation
- normalize the make all make comparable
- T-1log p(w1, … wn)=T-1[log p(w1)+∑log p(wt|wt-1)]
17
Q
Word2Vec
A
- is an extension of the bigram model
- y = softmax(W1W2x)
- W1W2 = V x V matrix like markov matrix
- ROW is probabilities of next word
- 2 parameters instead of 1 (Wx) makes smaller
- word drop threshold: p = 1 - np.sqrt(threshold/p_unigram)
18
Q
continuous bag of words (CBOW)
A
- list of words to predict word in the middle
- context size refers to the number of words surrounding
- brown fox jump over the
- some authors call 2 (or 4)
- this number is usually between 5 and 10 on either side
- there are better models than this
- approximating softmax is heirarchical softmax
19
Q
heirarchical softmax
A
- dealing with low probability words
- form a tree
- decide which branch using sigmoid as probability
- other branch is 1 - sigmoid
- multiply along path using chain rule
- frequent words closer to the top
20
Q
negative sampling
A
- the other solution to low probability of being correct
- using 5-25 negative samples (optimize though)
- input words: jump
- target words: brown fox over the
- negative samples: apple orange boat tokyo
- J = ∑logσ(W(2)cTW(1)in)+∑log[1-W(2)nTW(1)in]
- c = context
- n = negative samples
- raw form (what we use): J = ∑tnlogpn+(1-tn)log(1-pn)
- ∂J/∂W(2)=HT(P-T)
- H size 1 x D (as opposed to usual N x D)
21
Q
negative sampling updates
A
22
Q
multiclass versus binary classification
A