Quiz 4 - Module 3 Flashcards
LSTM output gate (ot)
- result of affine transformation of previous hidden state and current input passed through sigmoid
- modulates the value of the hidden state
- decides how much of the cell state we want to surface
RNN Language Mode: Inference
- Start with first word, in practice use a special symbol to indicate new sentence
- Feed the words in the history until we run out of history
- Take hidden state h, transform
- project h into a high dimensional space (same dimension as words in vocabulary)
- normalize transformed h
- use softmax
- result: probability distribution of believed next work for model
Why are graph embeddings useful?
- task-agnostic entity representations
- features are useful on downstream tasks without much data
- nearest neighbors are semantically meaningful
Contexualized Word Embeddings Algorithms
elmo, bert
The most standard form of attention in current neural networks is implemented with the ____
Softmax
Many to many Sequence Modeling examples
- speech recognition
- optical character recognition
Token-level tasks
- ex: named entity recognition
- input a sentence without any masked tokens + positions, go through transformer encoder architecture, output classifications of entities (persons, locations, dates)
Steps of Beam Search Algorithm
- Search exponential space in linear time
- Beam size k determines width of search
- At each step, extend each of k elements by one token
- Top k overall then become the hypthoses for next step
Self-Attention improves on the multi-layer softmax attention method by ___
“Multi-query hidden-state propagation”
Having a controller state for every single input.
The size of the controller state grows with the input
Data Scarcity Issues
- Language Similarity missing
- language is different from source (ie. not similar to english like spanish/french are)
- Domain incorrect
- ie. medical terms not social language
- Evaluation
- no access to real test set
Many to One Sequence Modeling examples
- Sentiment Analysis
- Topic Classification
Attention
Weighing or probability distribution over inputs that depends on computational state and inputs
Differentiably Selecting a Vector from a set
- Given vectors {u1, …, un} and query vector q
- The most similar vector to q can be found via softmax(Uq)
Alignment in machine translation
For each word in the target, get a distribution over words in the source
Graph embeddings are a form of ____ learning on graphs
unsupervised learning
What makes Non-Local Neural Networks differ from fully connected Neural Networks?
Output is the weighted summation dynamically computed based on the data. In fully connected layer, the weights are not dynamic (learned and applied regardless of input).
Similarity function in non-local neural network is data dependent. Allows the network to learn the connectivity pattern and learn for each piece of data what is important and then sum up the contribution across those pieces of data to form the output.
Distribution over inputs that depends on computational state and the inputs themselves
Attention
Roll a fair die and guess. Perplexity?
6
T/F:Softmax is useful for random selection
True
Recurrent Neural Networks are typically designed for ____ data
sequential
Sequence Transduction
Sequence to Sequence (Many to Many Sequence Modeling)
what information is important for graph representations?
- state
- compactly representing all the data we have processed thus far
- neighborhood
- what other elements to incorporate?
- selecting from a set of elements with similarity or attention
- propagation of info
- how to update info given selected elements
What dominates computation cost in machine translation
Inference
- Expensive
- step-by-step computation (auto-regressive, predict diff token at each step)
- output projection (vocab * output * beam size)
- deeper models
- Strategies
- smaller vocabs
- more efficient computation
- reduce depth/increase parallelism
What allows information to propagate directly between distant computationl nodes while making minimal structural assumptions?
The attention algorithm
Current (Standard) Approach to (Soft) Attention
- Take a set of vectors u1,…un
- Inner product each of the vectors with controller q
- unordered set
- Take the softmax of the set of numbers to get weights aj
- The output is the product of the weights aj and the inputs uk
How to evaluate word embeddings
- intrinsic
- evaluation on a specific/intermediate subtask
- ex - nearest neighbor of a particular word vector
- fast to compute
- helps to understand the system
- not clear if really helpful unless correlation to real task is established
- evaluation on a specific/intermediate subtask
- extrinsic
- evaluation on real task
- can take a long time to compute
- unclear if the subsystem is the problem or its interaction
- if replacing exactly one subsystem with another improves accuracy -> winning
RNNs, when unrolled are just ____ with _____ transformations and ____
RNNs, when unrolled are just feed-forward Neural Networks with affine transformations and nonlinearities
How do we define the probability of a context word given a center word?
Use the softmax on the inner product between the context word and inner word. Both words are represented by vectors.
Graph Embedding
Optimize the objective that connected nodes have more similar embeddings than unconnected via gradient descent
Multi-layer Soft Attention
Layers of attention where each is input the output of the previous attention layer. The controller q is the hidden state
LSTM input gate (gt)
- result of affine transformation of previous hidden state and current input passed through sigmoid
- decides how much the input should affect tthe cell state
Bleu score
Precision-based metric that measures n-gram overlap with a human reference
fastText
sub-word embeddings
Add info to word2vec which better handles out of vocabulary words
Word2Vec Idea/Context
- Idea - use words to predict their context words
- Context - a fixed window of size 2m
Applications of Language Modeling
- predictive typing
- in search fields
- for keyboards
- for assisted typing, e.g. sentence completion
- automatic speech recognition
- how likely is user to have said “my hair is wet” vs “my hairy sweat”
- basic grammar correction
- p(They’re happy together) > p(Their happy together)
Non-Autoregressive Machine Translation
Model generates all the tokens of a sequence in parallel resulting in faster generation speed compared to auto-regresive models, but with cost of lower accuracy
Conditional Language Modeling
Condition the language modeling equation on a new context, c
Hierarchical Compositionality for NLP
character -> word -> NP/VP/… -> clause -> sentence -> story
Flip a fair coin and guess. Perplexity?
2
Total loss for knowledge distillation