Final Exam Review Flashcards

Question

CNN calculating the number of parameters

Answer 1

input layer = 0 embedding layer = (vocab size x word embedding dimension) convolutional layer = (kernel x word embedding dimension x filters + filter size) dropout layer = 0 pooling layer = 0 parameters for each layer = (filter x output + output) number of parameters is the sum of all parameters

Answer 2

Number of Parameters=Units×(Input Dimension+Units+1)

Answer 3

CNN doesnt remember what has happened

Answer 4

Recurrent Neural Network. RNN retains the information

Answer 5

Vanishing Gradient Problem: In RNNs, during backpropagation through time (BPTT), the gradient is multiplied across many time steps. If the recurrent weights are small, the gradient may become exponentially small as it is propagated backward through the network. This results in the network being unable to effectively learn long-range dependencies, making it challenging for the model to capture information from earlier time steps.

Answer 6

LSTM stands for Long Short-Term Memory, and it is a type of recurrent neural network (RNN) architecture designed to overcome the limitations of traditional RNNs in capturing and learning long-range dependencies in sequential data. The LSTM architecture allows the network to selectively remember or forget information based on the context of the input sequence. This makes LSTMs powerful for modeling and learning patterns in sequences with complex dependencies, enabling them to capture information over extended time intervals.

Answer 7

GRU stands for "Gated Recurrent Unit." It is a type of recurrent neural network (RNN) architecture designed to capture and model dependencies in sequential data. GRU is an improvement over traditional RNNs in terms of mitigating the vanishing gradient problem and allowing for better learning of long-term dependencies.

Answer 8

one has 4 inteligent, one has 3 LSTM: LSTM has a more complex architecture compared to GRU. It includes three gates - input gate, forget gate, and output gate - and a memory cell to control the flow of information. GRU: GRU has a simpler architecture with only two gates - update gate and reset gate. It doesn't have a separate memory cell.

Answer 9

the "cell state" or "memory cell" is what refers to the long-term memory.

Answer 10

the "hidden state" is often considered to represent the short-term memory

Answer 11

there are three main gates that control the flow of information: the input gate the forget gate the output gate These gates are responsible for regulating the information flow into, out of, and within the memory cell.

Answer 12

Input Gate Weight Matrix (Wi): Controls the input information that should be stored in the cell state. Forget Gate Weight Matrix (Wf): Controls the information from the previous cell state that should be discarded or forgotten. Cell State Weight Matrix (Wc): Controls the values of the cell state, determining how much of the new input and the retained information from the previous state should contribute to the updated cell state. Output Gate Weight Matrix (Wo): Controls the information from the updated cell state that should be output as the hidden state.

Answer 13

(one is L to R, other is R to L) or (X0 to Xt or Xt to X0)

Answer 14

it combines 2 gates into 1 - more simplistic but can perform as good as LSTM

Answer 15

a LSTM that encodes the input sequence to a fixed-length internal representation W. The encoder processes the input sequence and transforms it into a fixed-dimensional context vector or a series of hidden states. Each element of the input sequence is typically embedded into a high-dimensional space, and recurrent or convolutional layers capture the sequential or spatial dependencies. The final hidden state or context vector summarizes the input sequence.

Answer 16

another LSTM that takes the internal representation W to extract the output sequence from that vector The decoder takes the context vector (or hidden states) from the encoder and generates the output sequence. Similar to the encoder, it usually consists of recurrent layers or other architectures capable of handling sequential data. During training, the decoder is fed the correct output sequence, one element at a time, and learns to generate the sequence step by step.

Answer 17

The training process for the encoder in a neural network, especially in the context of an encoder-decoder architecture, involves "adjusting the weights of the encoder's parameters to minimize the difference between the predicted outputs and the actual outputs"

Answer 18

attention is a mechanism that enables neural networks to focus on specific parts of the input sequence when making predictions or generating output. The attention mechanism has proven to be particularly useful in tasks involving sequential data, such as machine translation, text summarization, and question-answering.

Answer 19

the weights for different parts of the input sequence (encoder hidden states) are calculated based on their relevance to the current step in the decoder. The calculation typically involves a scoring mechanism, often computed using a neural network

Answer 20

Multi-Head Attention layer is a component used in models like the Transformer architecture. The idea behind multi-head attention is to allow the model to jointly attend to different parts of the input sequence in parallel, capturing diverse aspects of the relationships within the data

Answer 21

machine learning or deep learning model architecture that utilizes the Transformer architecture. The Transformer is a type of neural network architecture introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017. The Transformer architecture has since become a fundamental building block for various natural language processing (NLP) and machine learning tasks.

Answer 22

The "sequential approach" in the context of neural networks and machine learning refers to the traditional way of processing sequential data, where information is passed through the network one step at a time, typically in a fixed order. Recurrent Neural Networks (RNNs) are a common example of a model architecture that follows a sequential approach.

Answer 23

hyper tangent function ReLU function sigmoid function identity function

Answer 24

neural networks as a computational model to perform tasks such as pattern recognition, classification, regression, and more.

Answer 25

depends (1 or more than 1) there is no one-size-fits-all answer

Answer 26

to improve the performance of the model

Answer 27

a batch is a set of data samples used in one iteration of the training process

Answer 28

bert has context (research more) One-hot vectors represent words in a fixed and context-independent manner. Each word is encoded as a vector with all zeros except for one element representing the position of that word in the vocabulary. BERT, on the other hand, provides contextualized word representations. It considers the entire context of a word within a sentence, capturing dependencies and relationships between words.

Answer 29

Word embedding allows words to be represented as vectors in a continuous vector space. This representation captures semantic relationships between words. Words with similar meanings are closer in the vector space, enabling the model to understand and leverage semantic similarities.

Final Exam Review Flashcards

(55 cards)