Natural Language Processing Flashcards
NLP method summary
- find data: instead of spending months on unsupervised machine learning, take a couple weeks to label data
-
clean data (CTWMLL):
- irrelevant Characters
- Tokenize by separating into individual words
- irrelevant Words (such as twitter mentions or urls)
- consider Misspelled
- Lowercase
- consider Lemmatization
- find good data representation: i.e. - bag of words
- classification: stick with simplest for needs (I.e - logistic regression)
- inspection: confusion matrix
-
leveraging semantics
- method like Word2Vec (maybe average sentence representations)
- lose explainability, thus we should use tools like LIME
- end-to-end syntax: use convolutions or transformers where order matters
term-document
- bag of words method
- n-terms by m-documents
- raw word count
- binary: 1 if the word appears in document else 0
- word frequency: 100*(raw count)/(total count)
- TF-IDF
term frequency–inverse document frequency (TF-IDF)
- increases proportionally to the number of times a word appears in the document and is offset by the number of OTHER documents in the corpus that contain the word
- idf = log(N/{d∈D:t∈d}), denominator is number of documents with term
- shape V x D
problem with wordcount
- stopwords like “the” occur in all documents
- really long documents have really big eigenvalues
lemmatization
- determining the lemma of a word based on its intended meaning
- depends on correctly identifying the intended part of speech and meaning of a word in a sentence
- more accurate than stemming
main areas of NLP
- sentiment analysis: determine writer’s positive, negative, or neutral feelings towards a particular topic, product
- named entity recognition: breaks sentences into names, organizations, etc…
- part of speech tag: label the noun, pronoun, adjective, determiner, verb, adverb, preposition, conjunction, and interjection (i.e. use vertirbi)
- latent semantic analysis: use methods like PCA get representation in a smaller domain (embeddings)
word embedding
- equates to mapping of words
- can use GloVe or Word2Vec into new net
- can perform word analogy on these words
- idx is the word2idx position
- word along row, document along column
- latent feature D, We=V x D, D << N
- autoencoders/PCA/SVD create embedding
cosine distance
- find similtude between two vectors
- used instead of euclidean distance for words
- cos_dist = 1 - aTb/(||a|| ||b||)
- aTb = ||a|| ||b|| cos(a,b)
one hot encoding
- s = [0, 0, …, 1, …, 0] (1 x V)
- x = sWe
- the input returns row of encoded vector
- significantly reduces data size
- X is an index instead of a matrix
t-SNE
- t-distributed stochastic neighbor embedding
- non-linear; no transformation model (modifies output itself)
- no train and test set
- no transforming after fitting
n-gram models
- a sequence of consecutive words
- bigram: use past as inputs p(w|w_n-1)
- trigrams: use past and future words as inputs p(w|w_n-1, w_n+1)
chain rule of probability
- apply bayes rule again and again: p(A, B, C, D)=p(D|C)p(C|B)p(B|A)p(A)
- p(A) = word count / corpus length
- p(B|A) is part of the bigram model
- p(C|A, B) is trigram: count(A, B, C) / count(A, B)
markov assumption
- current state only relies on previous state
- 1st order Markov: condition on 1 word (bigram)
- this is typically what we consider markov model
- 2nd order Markov: condition on 2 words (trigram)
recursive neural tensor network (RNTN)
method for state of the art sentiment analyzer
training sentence generator
- tokenize each sentence
- map word to index
- save each sentence as a list of indices
- INPUT: [START, x0, x1, …, xN]
- TARGET: [x0, x1, …, xN, END]
- accuracy is number of words guessed correct out of all of the number of words
how to calculate probability of word sequence
- chain rule
- log likelihood to deal with small numbers
- random note: reverses softmax operation
- normalize the make all make comparable
- T-1log p(w1, … wn)=T-1[log p(w1)+∑log p(wt|wt-1)]
Word2Vec
- is an extension of the bigram model
- y = softmax(W1W2x)
- W1W2 = V x V matrix like markov matrix
- ROW is probabilities of next word
- 2 parameters instead of 1 (Wx) makes smaller
- word drop threshold: p = 1 - np.sqrt(threshold/p_unigram)
continuous bag of words (CBOW)
- list of words to predict word in the middle
- context size refers to the number of words surrounding
- brown fox jump over the
- some authors call 2 (or 4)
- this number is usually between 5 and 10 on either side
- there are better models than this
- approximating softmax is heirarchical softmax
heirarchical softmax
- dealing with low probability words
- form a tree
- decide which branch using sigmoid as probability
- other branch is 1 - sigmoid
- multiply along path using chain rule
- frequent words closer to the top
negative sampling
- the other solution to low probability of being correct
- using 5-25 negative samples (optimize though)
- input words: jump
- target words: brown fox over the
- negative samples: apple orange boat tokyo
- J = ∑logσ(W(2)cTW(1)in)+∑log[1-W(2)nTW(1)in]
- c = context
- n = negative samples
- raw form (what we use): J = ∑tnlogpn+(1-tn)log(1-pn)
- ∂J/∂W(2)=HT(P-T)
- H size 1 x D (as opposed to usual N x D)
negative sampling updates

multiclass versus binary classification

GloVe versus Word2Vec
- Word2Vec is predictor
- GloVe is a word count method
- both perform similarly
- GloVe is much more efficient
matrix factorization update equations

HMMs for POS tagging
- find the probabilities
- for observation: p(walk|verb)
- for state transition: p(noun|verb)
- use vertirbi to match a sequence of words to an unknown sequence
HMM POS tag model

named entity recognition (NER)
- similar to POS but with entities
- TRAINING THE SAME!!
- nouns: person company, location
- very imbalanced (~90%)
- using capatializing is cheating!
recursive neural network (RNN)
- don’t need a bunch of trees with recursion
- uses linear: hj=f(WTx+b)=
- linear discrimination analysis
- 2 Gaussian w/ same covariance
- x = w1=word embedding for w1
- h1=f(Wleftxleft+Wrightxright+b)
- W = R x D x D
- binary R=2, can be N-ary
recursive neural tensor network (RNTN)
- h = f(WTx + b), W (2D x D)
- uses quadratic: hj‘=f(xTAjx+WjTx+b), A (D x 2D x D)
- quadratic linear discrimination
- 2 Gaussian w/ diff covariance
- hj‘=f(xLTALLxL+xLTALRxR+xRTARRxR+WLTxL+WRTxR+b)
- lists: words, left children, right children
*
parse tree

sentiment analysis trees
- parse by good and bad
- can catch the dependency
- “this is kind of bad, but overall it is good”
recursive nn to RNN
- put children on left side
- store relations array
- -1 goes wherever there is not word
- 3 arrays to store trees
- parent: where to find parent node
- relations: how a child is related to a parent
- words: words associated? & index
- post order traversal: parent comes after children
CNN for NLP

sequence-to-sequence
- len(output) ≠ len(input)
- encoder
- no output
- encoding: only keep final state (h and c)
- decoder
- encoder h(Tx) = decoder s(0)
- start with ‘START’ tag as x1
- teacher forcing for training
different techniques for language model
- text generation: sample randomly
- machine translation: take argmax
seq2seq architecture

attentions vs seq2seq
- attention
- set s(0) = 0
- all of hidden states are stored
- determine which one we care most about
attention
- attention weights
- αt’=N([st-1,ht’]), t’ = 1, …, Tx
- copy s(t-1) and concat for each step
- t’ is for input sequence t’ = 1, …, Tx
- t is for output sequence t = 1, …, Ty
- αt’=N([st-1,ht’]), t’ = 1, …, Tx
- context = ∑α(t’)h(t’)
- teacher forcing
- training: concat context and target
- prediction: concat context and previous word
- each relies on previous state
attention model

attention update model

visualizing attention
- each time step requires context vector, Tx
- each context step requires an attention weight, Ty
- plot TxTy as an image
- should follow a somewhat linear pattern
memory network
- parts 1) story, 2) questions, 3) answer
- single supporting fact, 2 supporting, etc…
- only produces single output
- for 2-support, pass “hop”
- replaces question embedding
- reuse embedding for second hop creation of weights
- can add dense layer after hop
- everytime it reads the story it learns something new
memory network steps
Find part of story, then find relevant part of sentence
- sum word vectors for sentences
- sum word vectors for question
- dot story with the question = story weights
- softmax
- ~=represent important sentence
- dot story weights back with stories (softmax)
- output is V (vocab size)
memory network architecture

automating text analysis
- dictionaries: word count of psych. terms
- co-occurance: compare distances
- feature extraction:
- expressions
- ngram
- syntax
- POS
latent semantic analysis (LSA)
- measure the distance to some target document
- measure similarity to pregraded documents (analogies)
- can lead to similar results as a human grader
latent dirichlet allocation (LDA)
- find the topic that generates the given collection of documents
- relies on statistical dependence among words
- unlike LSA, topic have meaning
- semi-supervised: create topics related to different moral concerns
cohesion
- examines how a writer writes
- structural and lexical properties
- lexical: relating to words or vocabulary
- linguistic features
- lexical diversity
- semantic overlap
- connections b/w propositions
- causal links
- syntactic complexity
knowledge base method

entity linking
- remove POS: verb, adverb, adjective, pronoun, determiner and predisposition
- properties: keep office, role, etc.. entities, discard rest
- remove those with ρ < 0.1
- merge abstract and properties
- filter out words not related to morality
- cosine similarity b/w feature and word in BK
- threshold 0.6 regarded word as occurance of feature
bidirectional encoder representation from transformers (BERT)
- reads entire sequence of words at once
- masked LM: predict masked word
- loss calculated on 15% of tokens replaced with mask
- repredict the masked tokens
- next sentence prediction
- tokens at beginning and ends of sentence
- sentence embeddings
- position embedding
- loss calculated as subsequent sentence
masked LM architecture

next sentence prediction

NLP tests
- Google sensibleness and specificity average: measure whether text makes sense in current context
-
Winogrande: 44,000 size dataset to determine common sense with human
- humans = 94%, machine ~80%
- Rogue: a proxy for how well the generated summary exactly matches the unigrams and bigrams in a reference summary
- WikiText-103: perplexity
- LAMBADA: next word prediction accuracy
generative model
- finish sentence
- answer questions
- summarize content
biggest networks (Feb 2020)
- Turing-NLG (Microsoft): 17 billion parameters
- MegatronLM (NVidia): 8.3 billion
- GPT-2 (OpenAI): 1.5 billion
- Grover-Mega (U of W): 1.5 billion
- ElMo (AI2): 465 million
- RoBERTa (Facebook): 355 million
- BERT large (Google): 340 million
biggest network training data
- T-NLG
- trained on 100k direct answers
- finetuned multitasked fashion all public summarization datasets (~4 million training instances)
- GPT-2: 40GB of data
- Google Mina: 341 GB social media chatter