GENERAL_ML_CULTURE Flashcards

Question

What is model compression in neural networks? How does it perform? What are the main limitations? How this may be solved?

Answer 1

Model compression is a technique that shrinks trained neural networks. Compressed models often perform similarly to the original while using a fraction of the computational resources. The bottleneck in many applications, however, turns out to be training the original, large neural network before compression. Training smaller model from sractch may be the solutions

Answer 2

Because the model is easier to train with gradient descent (and we can prevent overfitting with regularization) This is probably because By sufficiently over-parameterizing our neural networks, we make the optimization landscape effectively convex.

Answer 3

Many weights are close to zero (Pruning) Weight matrices are low rank (Weight Factorization) Weights can be represented with only a few bits (Quantization) Layers typically learn similar functions (Weight Sharing)

Answer 4

Recurrent neural network

Answer 5

LSTM this is the main reason RNNs are so widely used

Answer 6

It is basically made up of an encoder and a decoder. These are usually RNNs.

Answer 7

https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

Answer 8

https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

Answer 9

The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformers outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization.

Answer 10

Allen Institute for AI

Answer 11

Transformers are a type of neural network architecture that have been gaining popularity. Transformers were recently used by OpenAI in their language models and are sequence to sequence models.

Answer 12

Is is made by an encoder block made up of several RNN maynly self attention and linear network. And a Decoder one which is very similar but wih an attention encoder block

Answer 13

Essenttialy they are the same plus a decoder encoder attention layer

Answer 14

The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways: **It expands the model’s ability to focus on different positions.** Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself. **It gives the attention layer multiple “representation subspaces”**. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized.

Answer 15

One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence. To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence.

Answer 16

YES One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a layer-normalization step.

Answer 17

The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector. Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer. The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

Answer 18

he encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

Answer 19

Second, an attention decoder does an extra step before producing its output. In order to focus on the parts of the input that are relevant to this decoding time step, the decoder does the following: Look at the set of encoder hidden states it received – each encoder hidden states is most associated with a certain word in the input sentence Give each hidden states a score (let’s ignore how the scoring is done for now) Multiply each hidden states by its softmaxed score, thus amplifying hidden states with high scores, and drowning out hidden states with low scores

Answer 20

The paper presents two model sizes for BERT: BERT BASE – Comparable in size to the OpenAI Transformer in order to compare performance BERT LARGE – A ridiculously huge model which achieved the state of the art results reported in the paper **BERT is basically a trained Transformer Encoder stack**.

Answer 21

Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before assigning each word in it an embedding. It uses a bi-directional LSTM trained on a specific task to be able to create those embeddings.

Answer 22

ELMo gained its language understanding from being trained to predict the next word in a sequence of words - a task called Language Modeling. This is convenient because we have vast amounts of text data that such a model can learn from without needing labels. ELMo comes up with the contextualized embedding through grouping together the hidden states (and initial embedding) in a certain way (concatenation followed by weighted summation). **bi-directional LSTM**

Answer 23

ULM-FiT introduced methods to effectively utilize a lot of what the model learns during pre-training – more than just embeddings, and more than contextualized embeddings. ULM-FiT introduced a language model and a process to effectively fine-tune that language model for various tasks. NLP finally had a way to do transfer learning probably as well as Computer Vision could.

Answer 24

**A pre-trained model for transfer learning to other tasks!** It turns out we don’t need an entire Transformer to adopt transfer learning and a fine-tunable language model for NLP tasks. We can do with just the decoder of the transformer. The decoder is a good choice because it’s a natural choice for language modeling (predicting the next word) since it’s built to mask future tokens – a valuable feature when it’s generating a translation word by word.

Answer 25

It uses decodeThe openAI transformer gave us a fine-tunable pre-trained model based on the Transformer. But something went missing in this transition from LSTMs to Transformers. ELMo’s language model was bi-directional, but the openAI transformer only trains a forward language model. Could we build a transformer-based model whose language model looks both forward and backwards (in the technical jargon – “is conditioned on both left and right context”)?r that can only see previous token (they are trained on machine translation writing words one by one) BERT is the answer

Answer 26

They used pre-train encoder instead of decoder. This allows them to have a model that knows about the previous and post context. But in transformers you can understand a word from the context of next word. So what can you do ? Finding the right task to train a Transformer stack of encoders is a complex hurdle that BERT resolves by adopting a “masked language model” concept from earlier literature (where it’s called a Cloze task). Beyond masking 15% of the input, BERT also mixes things a bit in order to improve how the model later fine-tunes. Sometimes it randomly replaces a word with another word and asks the model to predict the correct word in that position.

Answer 27

you can use the pre-trained BERT to create contextualized word embeddings. Then you can feed these embeddings to your existing model – a process the paper shows yield results not far behind fine-tuning BERT on a task such as named-entity recognition. Which one is the best layer??? depend on the task

GENERAL_ML_CULTURE Flashcards

Here we put info from Blog and posts. Where is the world of AI going, hot topics etc