Lecture 11 Flashcards
Long Short-Term Memory and Gated Recurrent Units for NLP
Long Short-Term Memory (LSTM)
Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) that is designed to handle sequential data such as time series, speech, and text. It is composed of a cell, an input gate, an output gate, and a forget gate. The cell remembers values over arbitrary time intervals, and the three gates regulate the flow of information into and out of the cell. LSTM networks are capable of processing data sequentially and keeping their hidden state through time. They are applicable to various tasks such as classification, speech recognition, machine translation, and healthcare.
Feedforward
Simple, unidirectional predictive
structures connecting input arrays
to output arrays
Convolutional
Sliding window moving across
time or multi-dimensional structures to capture features
Recurring
Neurons with feedback loops creating memory structures
with limited persistence
Gated
Cell units containing multiple
neurons and providing long term
memory
Backpropagation in RNNs
A recurrent neural network can be imagined as multiple copies of the same network, each passing a message to a successor.
Vanishing Gradient Problem
Words from time steps far away are not as influential as they should be any more
Forget gate:
how much information from the previous time step will
be kept?
Input gate:
which values will be updated and the new candidate values
Sigmoid function: outputs a number between 0 and 1
Tanh function
(hyperbolic tangent function): outputs a number between -1 and 1
Cell state:
Cell state: Update the old cell state, Ct-1, into the new cell state Ct.
* The new cell state 𝐶! is comprised of information from the past 𝑓! ∗ 𝐶!”# and valuable new information
elementwise multiplication
8 0 0 |
| 3 1 3 |
| 2 0.5 1 |
| 4 1 4 |
| 2 4 8 |
Based on the cell state, we will decide what the output will be
- tanh function filters the new cell state to characterize stored information
- Significant information in 𝐶t -> ±1
- Minor details -> 0
- ℎt serves as a hidden state for the next time step
Gated Recurrent Units (GRU)
In 2014, Cho and his colleagues posted a paper entitled, “Learning
phrase representations using RNN encoder-decoder for statistical
machine translation.” In this paper, the researchers introduced a
simplified LSTM model, which later became referred to as a GRU.
They evaluated their approach on the English/French translation task
of the WMT’14 workshop. In later papers, the GRU has often
performed as well as LSTM, even though it is simpler.
Gated Recurrent Unit (GRU)
GRU is a variation of LSTM that also adopts the gated design.
* Differences:
* GRU uses an update gate 𝒛 to substitute the input and forget gates
* Combines the cell state 𝐶! and hidden state ℎ! in LSTM as a single cell state ℎ!
* GRU obtains similar performance compared to LSTM with fewer parameters and
faster convergence. (Cho et al. 2014)
Update gate:
controls the composition of the new state
Reset gate:
determines how much old information is needed
in the alternative state ℎ#!
Alternative state:
contains new information
New state:
replace selected old information with new information in the new state
Text summarization using LSTM-CNN Song et al., 2018, Multimedia Tools & Apps
- Abstractive Text Summarization
generates readable summaries
without being constrained to
phrases from the original text - Training data: human generated
abstractive summary bullets from
CNN and DailyMail stories - ROUGE (Recall-Oriented
Understudy for Gisting Evaluation)
toolkit was used for evaluation - LSTM-CNN outperformed four
previous models by 1-4%
Extracting Temporal Relations from Korean Text
Lim & Choi, 2018, IEEE Big Data/Smart Computing
- From the article: “difficult to
correctly recognize the temporal
relations from Korean text owing to the inherent linguistic
characteristics of the Korean
language” - Dataset: Korean TimeBank - 2393
annotated documents and 6190
Korean sentences - F1 scores ranged from 0.46 to 0.90 on various temporal relations
Emotion Recognition in Online Comments (Li & Xiao, 2020)
- This model consists of
- an embedding layer
- a bidirectional LSTM
layer - a feedforward
attention layer - a concatenation layer
- an output layer
- Training data: Emotion
labelled twitter data and
blog data - F-1 measure: 62.78%
LSTM
Key Features
- Long Short-Term Memory layer - Hochreiter 1997.
- Based on available runtime hardware and constraints, this layer will choose different
implementations (cuDNN-based or pure-TensorFlow) to maximize the performance. If a GPU is available and all the arguments to the layer
meet the requirement of the CuDNN kernel (see below for details), the layer will use a fast cuDNN implementation. - When processing very long sequences (possibly infinite), you may want to use the pattern of cross-batch statefulness.
- Normally, the internal state of a RNN layer is reset every time it sees a new batch (i.e. every sample
seen by the layer is assumed to be independent of the past). The layer will only maintain a state while
processing a given sample.
LSTM
Key Arguments
- units: Positive integer, dimensionality of the output space.
- activation: Activation function to use. Default: hyperbolic tangent (tanh). If you pass None, no
activation is applied (ie. “linear” activation: a(x) = x). - recurrent_activation: Activation function to use for the recurrent step. Default: sigmoid (sigmoid). If
you pass None, no activation is applied (ie. “linear” activation: a(x) = x). - kernel_initializer: Initializer for the kernel weights matrix, used for the linear transformation of the
inputs. Default: glorot_uniform. - unit_forget_bias: Boolean (default True). If True, add 1 to the bias of the forget gate at initialization.
Setting it to true will also
force bias_initializer=”zeros”
GRU
Key Features
- Gated Recurrent Unit based on Cho et al (2014).
- There are two variants of the GRU
implementation. The default one is based on v3 and has reset gate applied to hidden state
before matrix multiplication. The other one is based on original and has the order reversed. - The second variant is compatible with CuDNNGRU (GPU-only) and allows inference on CPU. Thus it
has separate biases for kernel and
recurrent_kernel. To use this variant, set ‘reset_after’=True and
recurrent_activation=’sigmoid’. - In TensorFlow 2.0, the built-in LSTM and GRU layers have been updated to leverage CuDNN
kernels by default when a GPU is available. With this change, the prior layers have been deprecated, and you can build your model without worrying about the hardware it will run on.
GRU
Key Arguments
- units: Positive integer, dimensionality of the output space.
- activation: Activation function to use. Default: hyperbolic tangent (tanh). If you pass None, no
activation is applied (ie. “linear” activation: a(x) = x). - recurrent activation: Activation function to use for the recurrent step. Default: sigmoid (sigmoid). If
you pass None, no activation is applied (ie. “linear” activation: a(x) = x). - kernel initializer: Initializer for the kernel weights matrix, used for the linear transformation of the
inputs. Default: glorot_uniform. - unit_forget_bias: Boolean (default True). If True, add 1 to the bias of the forget gate at initialization.
Setting it to true will also
force bias initializer=”zeros”. This is recommended in Jozefowicz et al..