Natural Language Processing Flashcards

Question

HMMs for POS tagging

Answer 1

1. find the probabilities 1. for observation: p(walk|verb) 2. for state transition: p(noun|verb) 2. use vertirbi to match a sequence of words to an unknown sequence

Answer 2

* similar to POS but with entities * TRAINING THE SAME!! * nouns: person company, location * very imbalanced (~90%) * using capatializing is cheating!

Answer 3

* don't need a bunch of trees with recursion * uses linear: h_j=f(W^Tx+b)= * linear discrimination analysis * 2 Gaussian w/ same covariance * x = w₁=word embedding for w1 * h₁=f(W_leftx_left+W_rightx_right+b) * W = R x D x D * binary R=2, can be N-ary

Answer 4

* h = f(W^Tx + b), W (2D x D) * uses quadratic: h_j'=f(x^TA_jx+W_j^Tx+b), A (D x 2D x D) * quadratic linear discrimination * 2 Gaussian w/ diff covariance * h_j'=f(x_L^TA_LLx_L+x_L^TA_LRx_R+x_R^TA_RRx_R+W_L^Tx_L+W_R^Tx_R+b) * lists: words, left children, right children *

Answer 5

* parse by good and bad * can catch the dependency * "this is kind of bad, but overall it is good"

Answer 6

1. put children on left side 2. store relations array 3. -1 goes wherever there is not word * 3 arrays to store trees * parent: where to find parent node * relations: how a child is related to a parent * words: words associated? & index * post order traversal: parent comes after children

Answer 7

* len(output) ≠ len(input) * encoder * no output * encoding: only keep final state (h and c) * decoder * encoder h(T_x) = decoder s(0) * start with 'START' tag as x₁ * teacher forcing for training

Answer 8

1. text generation: sample randomly 2. machine translation: take argmax

Answer 9

* attention * set s(0) = 0 * all of hidden states are stored * determine which one we care most about

Answer 10

* attention weights * α_t'=*N*([s_t-1,h_t_']), t' = 1, ..., T_x * copy s(t-1) and concat for each step * t' is for input sequence t' = 1, ..., T_x * t is for output sequence t = 1, ..., T_y * context = ∑α(t')h(t') * teacher forcing * training: concat context and target * prediction: concat context and previous word * each relies on previous state

Answer 11

1. each time step requires context vector, T_x 2. each context step requires an attention weight, T_y 3. plot T_xT_y as an image 4. should follow a somewhat linear pattern

Answer 12

* parts 1) story, 2) questions, 3) answer * single supporting fact, 2 supporting, etc... * only produces single output * for 2-support, pass "hop" * replaces question embedding * reuse embedding for second hop creation of weights * can add dense layer after hop * everytime it reads the story it learns something new

Answer 13

Find part of story, then find relevant part of sentence 1. sum word vectors for sentences 2. sum word vectors for question 3. dot story with the question = story weights * softmax * ~=represent important sentence 4. dot story weights back with stories (softmax) 1. output is V (vocab size)

Answer 14

* dictionaries: word count of psych. terms * co-occurance: compare distances * feature extraction: * expressions * ngram * syntax * POS

Answer 15

* measure the distance to some target document * measure similarity to pregraded documents (analogies) * can lead to similar results as a human grader

Answer 16

* find the topic that generates the given collection of documents * relies on statistical dependence among words * unlike LSA, topic have meaning * semi-supervised: create topics related to different moral concerns

Answer 17

* examines how a writer writes * structural and lexical properties * lexical: relating to words or vocabulary * linguistic features * lexical diversity * semantic overlap * connections b/w propositions * causal links * syntactic complexity

Answer 18

1. remove POS: verb, adverb, adjective, pronoun, determiner and predisposition 2. properties: keep office, role, etc.. entities, discard rest 3. remove those with ρ \< 0.1 4. merge abstract and properties 5. filter out words not related to morality 6. cosine similarity b/w feature and word in BK 7. threshold 0.6 regarded word as occurance of feature

Answer 19

* reads entire sequence of words at once * masked LM: predict masked word * loss calculated on 15% of tokens replaced with mask * repredict the masked tokens * next sentence prediction * tokens at beginning and ends of sentence * sentence embeddings * position embedding * loss calculated as subsequent sentence

Answer 20

1. **Google sensibleness and specificity average**: measure whether text makes sense in current context 2. **Winogrande**: 44,000 size dataset to determine common sense with human 1. humans = 94%, machine ~80% 3. **Rogue**: a proxy for how well the generated summary exactly matches the unigrams and bigrams in a reference summary 4. **WikiText-103**: perplexity 5. **LAMBADA:** next word prediction accuracy

Answer 21

* finish sentence * answer questions * summarize content

Answer 22

1. Turing-NLG (Microsoft): 17 billion parameters 2. MegatronLM (NVidia): 8.3 billion 3. GPT-2 (OpenAI): 1.5 billion 4. Grover-Mega (U of W): 1.5 billion 5. ElMo (AI2): 465 million 6. RoBERTa (Facebook): 355 million 7. BERT large (Google): 340 million

Answer 23

1. T-NLG 1. trained on 100k direct answers 2. finetuned multitasked fashion all public summarization datasets (~4 million training instances) 2. GPT-2: 40GB of data 3. Google Mina: 341 GB social media chatter

Natural Language Processing Flashcards

(57 cards)