GENERAL_ML_CULTURE Flashcards

Here we put info from Blog and posts. Where is the world of AI going, hot topics etc

1
Q

What are things it would be great to do in NLP in 2020?

A
  • Learning from few samples rather than from large datasets
  • Compact and efficient rather than huge models
  • Evaluate on at least another language (from a different language family)
  • New datasets contain at least one other language
  • Characterize wrong utterances? what is the linguistic cause of that?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

NeurIPS 2019

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is GLUE in NLP?

A

In an April 2018 paper coauthored with collaborators from the University of Washington and DeepMind, the Google-owned artificial intelligence company, Bowman introduced a battery of nine reading-comprehension tasks for computers called GLUE (General Language Understanding Evaluation). The test was designed as “a fairly representative sample of what the research community thought were interesting challenges,” said Bowman, but also “pretty straightforward for humans.” For example, one task asks whether a sentence is true based on information offered in a preceding sentence. If you can tell that “President Trump landed in Iraq for the start of a seven-day visit” implies that “President Trump is on an overseas visit,” you’ve just passed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the current status of NLP system in the GLUE challenge?

A

State of the art was around 60% before 2018.

Then

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Who developed GPT and GPT-2

A

OpenAI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In non technical term: what are the 3 main ingredients of BERT?

A

a deep pretrained language model, attention and bidirectionality

They existed existed independently before BERT. But until Google released its recipe in late 2018, no one had combined them in such a powerful way.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Tell more in non technical term about bidirectionalty in BERT

A

Finally, the third ingredient in BERT’s recipe takes nonlinear reading one step further.

Unlike other pretrained language models, many of which are created by having neural networks read terabytes of text from left to right, BERT’s model reads left to right and right to left at the same time, and learns to predict words in the middle that have been randomly masked from view.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In non technical term what is attention?

A

is the ability to figure out which features of a sentence are most important.

state-of-the-art neural networks also suffered from a built-in constraint: They all looked through the sequence of words one by one.

a mechanism that lets each layer of the network assign more weight to some specific features of the input than to others. This new attention-focused architecture, called a transformer, could take a sentence like “a dog bites the man” as input and encode each word in many different ways in parallel. For example, a transformer might connect “bites” and “man” together as verb and object, while ignoring “a”

This treelike representation of sentences gave transformers a powerful way to model contextual meaning, and also to efficiently learn associations between words that might be far away from each other in complex sentences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What were the 2 main word embedding the old NLP models?

A

Word2Vec and Glove

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the main limitation of using pre-trained NLP word embedding?

A

Though these pretrained word embeddings have been immensely influential, they have a major limitation: they only incorporate previous knowledge in the first layer of the model—the rest of the network still needs to be trained from scratch.

Using word embeddings is like initializing a computer vision model with pretrained representations that only encode edges: they will be helpful for many tasks, but they fail to capture higher-level information that might be even more useful. A model initialized with word embeddings needs to learn from scratch not only to disambiguate words, but also to derive meaning from a sequence of words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In what sense can ULMFiT, ELMo, and the OpenAI GPT and BERT be considered as the imagenet for language?

A

one key paradigm shift: going from just initializing the first layer of our models to pretraining the entire model with hierarchical representations. If learning word vectors is like only learning edges, these approaches are like learning the full hierarchy of features, from edges to shapes to high-level semantic concepts.

“ImageNet for language”—that is, a task that enables models to learn higher-level nuances of language, similarly to how ImageNet has enabled training of CV models that learn general-purpose features of images.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the main aspects that are true both for Image net deep CV and BERT and co in NLP?

A

* training data are as important as the algorithm

* transfer learning of network trained on huge datasets is essential

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what are some open problems in face recognition?

A

including multi-camera tracking, re-identification (when someone exits the frame and then re-enters), robustness to occasional camera outages, and automatic multi-camera calibration. Such capabilities will advance significantly in the next few years.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is self-supervised learning? Give examples of a field where they are used a lot

A

self-supervised learning. It’s similar to supervised learning, but instead of training the system to map data examples to a classification, we mask some examples and ask the machine to predict the missing pieces. For instance, we might mask some frames of a video and train the machine to fill in the blanks based on the remaining frames.

They are the key of NLP, Models such as BERT, RoBERTa, XLNet, and XLM are trained in a self-supervised manner to predict words missing from a text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is one of the most fascinating thing about reinforcement learning?

A

But there’s a problem here: to be able to collect rewards, some “non-special” actions are needed to be taken — you have to walk towards the coins before you can collect them. So an Agent must learn how to handle postponed rewards by learning to link those to the actions that really caused them. In my opinion, this is the most fascinating thing in Reinforcement Learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What was the main highlight for nlp in 2019!

A

Definitely Transformers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does BERT stands for ?

A

BERT stands for Bidirectional Encoder Representations from Transformers.

This model is basically a multi-layer bidirectional Transformer encoder (Devlin, Chang, Lee, & Toutanova, 2019), and there are multiple excellent guides about how it works generally, including the Illustrated Transformer. What we focus on is one specific component of Transformer architecture known as self-attention. In a nutshell, it is a way to weigh the components of the input and output sequences so as to model relations between them, even long-distance dependencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Is it certain that BERT success is due to self attention?

A

It is still under debate see the blogpost here

https://text-machine-lab.github.io/blog/2020/bert-secrets/?utm_campaign=NLP%20News&utm_medium=email&utm_source=Revue%20newsletter

Not just in practice, looking at activated heads.

Even in principle it might be that is not the case

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What about the number of parameters in BERT is that too little enough or too much for the usual task?

A

BERT is heavily overparametrized.

In our experiments we disabled only one head at a time, and the fact that in most cases the model performance did not suffer suggests that many heads have functional duplicates, i.e. disabling one head would not harm the model because the same information is available elsewhere.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In general do weights change that much during fine tuning ?

A

While accuracy increases a lot! during fine tuning weights do not change that much.

We see that most attention weights do not change all that much, and for most tasks, the last two layers show the most change. These changes do not appear to favor any specific types of meaningful attention patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

In the transformer/heads context like BERT what is a self attention map?

A

As a brief example, let’s say we need to create a representation of the sentence “Tom is a black cat”. BERT may choose to pay more attention to “Tom” while encoding the word “cat”, and less attention to the words “is”, “a”, “black”. This could be represented as a vector of weights (for each word in the sentence). Such vectors are computed when the model encodes each word in the sequence, yielding a square matrix which we refer to as the self-attention map.

22
Q

What are the types of self attention patterns that are learned by BERT?

A

The vertical pattern indicates attention to a single token, which usually is either the [SEP] token (special token representing the end of a sentence), or [CLS] (special BERT token that is used as full sequence representation fed to the classifiers).

The diagonal pattern indicates the attention to previous/next words;

The block pattern indicates more-or-less uniform attention to all tokens in a sequence;

The heterogeneous pattern is the only pattern that theoretically could correspond to anything like meaningful relations between parts of the input sequence (although not necessarily so).

23
Q

How many heads/layers are used in BERT during inference time? A lot? do they differ for dirrent tasks?

A

Again BERT is probably overparametrized!!

It is clear that while the overall pattern varies between tasks, on average we are better off removing a random head - including those that we identified as encoding meaningful information that should be relevant for most tasks.

Many of the heads can also be switched off without any effect on performance, again pointing at the fact that even the base BERT is severely overparametrized.

24
Q

In actual facts, do BERT needs a lot of pre-training for the actual usual tasks it is used for? Like the GLUE one?

In other words does it needs a lot of linguistic knowledge?

A

BERT does not need to be all that smart for these tasks. The fact that BERT can do so well on most GLUE tasks without pre-training suggests that to a large degree they can be solved without much of language knowledge. Instead of verbal reasoning, it may learn to rely on various shortcuts, biases and artifacts in the datasets to arrive at the correct prediction. In that case its self-attention maps do not necessarily have to be meaningful to us

25
Q

What is model compression in neural networks?

How does it perform?

What are the main limitations?

How this may be solved?

A

Model compression is a technique that shrinks trained neural networks.

Compressed models often perform similarly to the original while using a fraction of the computational resources.

The bottleneck in many applications, however, turns out to be training the original, large neural network before compression.

Training smaller model from sractch may be the solutions

26
Q

Why is sometimes overparametrization necessary when training a model?

A

Because the model is easier to train with gradient descent (and we can prevent overfitting with regularization)

This is probably because By sufficiently over-parameterizing our neural networks, we make the optimization landscape effectively convex.

27
Q

What are the main 4 ways that model compression uses to achive compression?

A

Many weights are close to zero (Pruning)

Weight matrices are low rank (Weight Factorization)

Weights can be represented with only a few bits (Quantization)

Layers typically learn similar functions (Weight Sharing)

28
Q

What type of NN are LSTM part of ?

A

Recurrent neural network

29
Q

What is the most important RNNs type of network?

A

LSTM this is the main reason RNNs are so widely used

30
Q

what was an essential part of the seq2aeq machine translation architecture that revolutionazed machine translation?

A

Attention

31
Q

What are the basic structure of a se2seq model ?

A

It is basically made up of an encoder and a decoder. These are usually RNNs.

32
Q

Explain in details how a seq2seq model for machine translation work. Without attention.

A

https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

33
Q

Explain in details how a simple seq2seq model for machine translation work. With attention.

A

https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

34
Q

What does the concept of transform improve the most, the accuracy/ ability yo learn or the training speed for NN?

A

The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformers outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization.

35
Q

Where is ELMO model coming from?

A

Allen Institute for AI

36
Q

What are “transformers model” ?

A

Transformers are a type of neural network architecture that have been gaining popularity. Transformers were recently used by OpenAI in their language models and are sequence to sequence models.

37
Q

What is the structure of transformer architecture?

A

Is is made by an encoder block made up of several RNN maynly self attention and linear network. And a Decoder one which is very similar but wih an attention encoder block

38
Q

what is the difference in the structure (constituets) of encoder and decoder in a transfoermer architecture?

A

Essenttialy they are the same plus a decoder encoder attention layer

39
Q

describe the multiheaded attention describe in the attention is all you need paper and used in BERT etc

A

The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways:

It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself.

It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized.

40
Q

What is positional encoding and how is it used in Transformer?

A

One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.

To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence.

41
Q

Do transformers have residuals in their architecture (resnet tyoe?)

A

YES

One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a layer-normalization step.

42
Q

How does transformer turn the output of the last decoder layer into words?

A

The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.

Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.

The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

43
Q

How does the docoder side of a transformer works?

A

he encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence:

https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

44
Q

How does the encoder decodert attention units works?

A

Second, an attention decoder does an extra step before producing its output. In order to focus on the parts of the input that are relevant to this decoding time step, the decoder does the following:

Look at the set of encoder hidden states it received – each encoder hidden states is most associated with a certain word in the input sentence

Give each hidden states a score (let’s ignore how the scoring is done for now)

Multiply each hidden states by its softmaxed score, thus amplifying hidden states with high scores, and drowning out hidden states with low scores

45
Q

In the framework of transformer architecture what is BERT?

A

The paper presents two model sizes for BERT:

BERT BASE – Comparable in size to the OpenAI Transformer in order to compare performance

BERT LARGE – A ridiculously huge model which achieved the state of the art results reported in the paper

BERT is basically a trained Transformer Encoder stack.

46
Q

What was the radical difference between previous embedding like Glove and word2vec compare to ELMO?

What kind of network is it based on?

A

Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before assigning each word in it an embedding. It uses a bi-directional LSTM trained on a specific task to be able to create those embeddings.

47
Q

How is ELMO (embedding) trained?

A

ELMo gained its language understanding from being trained to predict the next word in a sequence of words - a task called Language Modeling. This is convenient because we have vast amounts of text data that such a model can learn from without needing labels.

ELMo comes up with the contextualized embedding through grouping together the hidden states (and initial embedding) in a certain way (concatenation followed by weighted summation).

bi-directional LSTM

48
Q

Very Broadly what did ULM-FiT add to the NLP progress?

A

ULM-FiT introduced methods to effectively utilize a lot of what the model learns during pre-training – more than just embeddings, and more than contextualized embeddings. ULM-FiT introduced a language model and a process to effectively fine-tune that language model for various tasks.

NLP finally had a way to do transfer learning probably as well as Computer Vision could.

49
Q

What was open AI transformer?

A

A pre-trained model for transfer learning to other tasks!

It turns out we don’t need an entire Transformer to adopt transfer learning and a fine-tunable language model for NLP tasks. We can do with just the decoder of the transformer. The decoder is a good choice because it’s a natural choice for language modeling (predicting the next word) since it’s built to mask future tokens – a valuable feature when it’s generating a translation word by word.

50
Q

What was the problem of open AI transformer as a pretrainded model for transfer learning to other NLP?

A

It uses decodeThe openAI transformer gave us a fine-tunable pre-trained model based on the Transformer. But something went missing in this transition from LSTMs to Transformers. ELMo’s language model was bi-directional, but the openAI transformer only trains a forward language model. Could we build a transformer-based model whose language model looks both forward and backwards (in the technical jargon – “is conditioned on both left and right context”)?r that can only see previous token (they are trained on machine translation writing words one by one)

BERT is the answer

51
Q

What was the intuition of BERT compared to previous model for transfer learning and embedding in NLP? for example open ai transfoermer?

A

They used pre-train encoder instead of decoder. This allows them to have a model that knows about the previous and post context.

But in transformers you can understand a word from the context of next word. So what can you do ?

Finding the right task to train a Transformer stack of encoders is a complex hurdle that BERT resolves by adopting a “masked language model” concept from earlier literature (where it’s called a Cloze task).

Beyond masking 15% of the input, BERT also mixes things a bit in order to improve how the model later fine-tunes. Sometimes it randomly replaces a word with another word and asks the model to predict the correct word in that position.

52
Q

How can you use pre-trained BERT to create word embeddings?

A

you can use the pre-trained BERT to create contextualized word embeddings. Then you can feed these embeddings to your existing model – a process the paper shows yield results not far behind fine-tuning BERT on a task such as named-entity recognition.

Which one is the best layer??? depend on the task