Quiz 4 - Module 3 Flashcards
LSTM output gate (ot)
- result of affine transformation of previous hidden state and current input passed through sigmoid
- modulates the value of the hidden state
- decides how much of the cell state we want to surface
RNN Language Mode: Inference
- Start with first word, in practice use a special symbol to indicate new sentence
- Feed the words in the history until we run out of history
- Take hidden state h, transform
- project h into a high dimensional space (same dimension as words in vocabulary)
- normalize transformed h
- use softmax
- result: probability distribution of believed next work for model
Why are graph embeddings useful?
- task-agnostic entity representations
- features are useful on downstream tasks without much data
- nearest neighbors are semantically meaningful
Contexualized Word Embeddings Algorithms
elmo, bert
The most standard form of attention in current neural networks is implemented with the ____
Softmax
Many to many Sequence Modeling examples
- speech recognition
- optical character recognition
Token-level tasks
- ex: named entity recognition
- input a sentence without any masked tokens + positions, go through transformer encoder architecture, output classifications of entities (persons, locations, dates)
Steps of Beam Search Algorithm
- Search exponential space in linear time
- Beam size k determines width of search
- At each step, extend each of k elements by one token
- Top k overall then become the hypthoses for next step

Self-Attention improves on the multi-layer softmax attention method by ___
“Multi-query hidden-state propagation”
Having a controller state for every single input.
The size of the controller state grows with the input
Data Scarcity Issues
- Language Similarity missing
- language is different from source (ie. not similar to english like spanish/french are)
- Domain incorrect
- ie. medical terms not social language
- Evaluation
- no access to real test set
Many to One Sequence Modeling examples
- Sentiment Analysis
- Topic Classification
Attention
Weighing or probability distribution over inputs that depends on computational state and inputs
Differentiably Selecting a Vector from a set
- Given vectors {u1, …, un} and query vector q
- The most similar vector to q can be found via softmax(Uq)
Alignment in machine translation
For each word in the target, get a distribution over words in the source
Graph embeddings are a form of ____ learning on graphs
unsupervised learning
What makes Non-Local Neural Networks differ from fully connected Neural Networks?
Output is the weighted summation dynamically computed based on the data. In fully connected layer, the weights are not dynamic (learned and applied regardless of input).
Similarity function in non-local neural network is data dependent. Allows the network to learn the connectivity pattern and learn for each piece of data what is important and then sum up the contribution across those pieces of data to form the output.
Distribution over inputs that depends on computational state and the inputs themselves
Attention
Roll a fair die and guess. Perplexity?
6
T/F:Softmax is useful for random selection
True
Recurrent Neural Networks are typically designed for ____ data
sequential
Sequence Transduction
Sequence to Sequence (Many to Many Sequence Modeling)
what information is important for graph representations?
- state
- compactly representing all the data we have processed thus far
- neighborhood
- what other elements to incorporate?
- selecting from a set of elements with similarity or attention
- propagation of info
- how to update info given selected elements
What dominates computation cost in machine translation
Inference
- Expensive
- step-by-step computation (auto-regressive, predict diff token at each step)
- output projection (vocab * output * beam size)
- deeper models
- Strategies
- smaller vocabs
- more efficient computation
- reduce depth/increase parallelism
What allows information to propagate directly between distant computationl nodes while making minimal structural assumptions?
The attention algorithm
















