GENERAL_ML_CULTURE Flashcards
Here we put info from Blog and posts. Where is the world of AI going, hot topics etc
What are things it would be great to do in NLP in 2020?
- Learning from few samples rather than from large datasets
- Compact and efficient rather than huge models
- Evaluate on at least another language (from a different language family)
- New datasets contain at least one other language
- Characterize wrong utterances? what is the linguistic cause of that?
NeurIPS 2019
What is GLUE in NLP?
In an April 2018 paper coauthored with collaborators from the University of Washington and DeepMind, the Google-owned artificial intelligence company, Bowman introduced a battery of nine reading-comprehension tasks for computers called GLUE (General Language Understanding Evaluation). The test was designed as “a fairly representative sample of what the research community thought were interesting challenges,” said Bowman, but also “pretty straightforward for humans.” For example, one task asks whether a sentence is true based on information offered in a preceding sentence. If you can tell that “President Trump landed in Iraq for the start of a seven-day visit” implies that “President Trump is on an overseas visit,” you’ve just passed.
What is the current status of NLP system in the GLUE challenge?
State of the art was around 60% before 2018.
Then
Who developed GPT and GPT-2
OpenAI
In non technical term: what are the 3 main ingredients of BERT?
a deep pretrained language model, attention and bidirectionality
They existed existed independently before BERT. But until Google released its recipe in late 2018, no one had combined them in such a powerful way.
Tell more in non technical term about bidirectionalty in BERT
Finally, the third ingredient in BERT’s recipe takes nonlinear reading one step further.
Unlike other pretrained language models, many of which are created by having neural networks read terabytes of text from left to right, BERT’s model reads left to right and right to left at the same time, and learns to predict words in the middle that have been randomly masked from view.
In non technical term what is attention?
is the ability to figure out which features of a sentence are most important.
state-of-the-art neural networks also suffered from a built-in constraint: They all looked through the sequence of words one by one.
a mechanism that lets each layer of the network assign more weight to some specific features of the input than to others. This new attention-focused architecture, called a transformer, could take a sentence like “a dog bites the man” as input and encode each word in many different ways in parallel. For example, a transformer might connect “bites” and “man” together as verb and object, while ignoring “a”
This treelike representation of sentences gave transformers a powerful way to model contextual meaning, and also to efficiently learn associations between words that might be far away from each other in complex sentences.
What were the 2 main word embedding the old NLP models?
Word2Vec and Glove
What is the main limitation of using pre-trained NLP word embedding?
Though these pretrained word embeddings have been immensely influential, they have a major limitation: they only incorporate previous knowledge in the first layer of the model—the rest of the network still needs to be trained from scratch.
Using word embeddings is like initializing a computer vision model with pretrained representations that only encode edges: they will be helpful for many tasks, but they fail to capture higher-level information that might be even more useful. A model initialized with word embeddings needs to learn from scratch not only to disambiguate words, but also to derive meaning from a sequence of words.
In what sense can ULMFiT, ELMo, and the OpenAI GPT and BERT be considered as the imagenet for language?
one key paradigm shift: going from just initializing the first layer of our models to pretraining the entire model with hierarchical representations. If learning word vectors is like only learning edges, these approaches are like learning the full hierarchy of features, from edges to shapes to high-level semantic concepts.
“ImageNet for language”—that is, a task that enables models to learn higher-level nuances of language, similarly to how ImageNet has enabled training of CV models that learn general-purpose features of images.
What are the main aspects that are true both for Image net deep CV and BERT and co in NLP?
* training data are as important as the algorithm
* transfer learning of network trained on huge datasets is essential
what are some open problems in face recognition?
including multi-camera tracking, re-identification (when someone exits the frame and then re-enters), robustness to occasional camera outages, and automatic multi-camera calibration. Such capabilities will advance significantly in the next few years.
What is self-supervised learning? Give examples of a field where they are used a lot
self-supervised learning. It’s similar to supervised learning, but instead of training the system to map data examples to a classification, we mask some examples and ask the machine to predict the missing pieces. For instance, we might mask some frames of a video and train the machine to fill in the blanks based on the remaining frames.
They are the key of NLP, Models such as BERT, RoBERTa, XLNet, and XLM are trained in a self-supervised manner to predict words missing from a text.
What is one of the most fascinating thing about reinforcement learning?
But there’s a problem here: to be able to collect rewards, some “non-special” actions are needed to be taken — you have to walk towards the coins before you can collect them. So an Agent must learn how to handle postponed rewards by learning to link those to the actions that really caused them. In my opinion, this is the most fascinating thing in Reinforcement Learning.
What was the main highlight for nlp in 2019!
Definitely Transformers
What does BERT stands for ?
BERT stands for Bidirectional Encoder Representations from Transformers.
This model is basically a multi-layer bidirectional Transformer encoder (Devlin, Chang, Lee, & Toutanova, 2019), and there are multiple excellent guides about how it works generally, including the Illustrated Transformer. What we focus on is one specific component of Transformer architecture known as self-attention. In a nutshell, it is a way to weigh the components of the input and output sequences so as to model relations between them, even long-distance dependencies.
Is it certain that BERT success is due to self attention?
It is still under debate see the blogpost here
https://text-machine-lab.github.io/blog/2020/bert-secrets/?utm_campaign=NLP%20News&utm_medium=email&utm_source=Revue%20newsletter
Not just in practice, looking at activated heads.
Even in principle it might be that is not the case
What about the number of parameters in BERT is that too little enough or too much for the usual task?
BERT is heavily overparametrized.
In our experiments we disabled only one head at a time, and the fact that in most cases the model performance did not suffer suggests that many heads have functional duplicates, i.e. disabling one head would not harm the model because the same information is available elsewhere.
In general do weights change that much during fine tuning ?
While accuracy increases a lot! during fine tuning weights do not change that much.
We see that most attention weights do not change all that much, and for most tasks, the last two layers show the most change. These changes do not appear to favor any specific types of meaningful attention patterns.

In the transformer/heads context like BERT what is a self attention map?
As a brief example, let’s say we need to create a representation of the sentence “Tom is a black cat”. BERT may choose to pay more attention to “Tom” while encoding the word “cat”, and less attention to the words “is”, “a”, “black”. This could be represented as a vector of weights (for each word in the sentence). Such vectors are computed when the model encodes each word in the sequence, yielding a square matrix which we refer to as the self-attention map.
What are the types of self attention patterns that are learned by BERT?
The vertical pattern indicates attention to a single token, which usually is either the [SEP] token (special token representing the end of a sentence), or [CLS] (special BERT token that is used as full sequence representation fed to the classifiers).
The diagonal pattern indicates the attention to previous/next words;
The block pattern indicates more-or-less uniform attention to all tokens in a sequence;
The heterogeneous pattern is the only pattern that theoretically could correspond to anything like meaningful relations between parts of the input sequence (although not necessarily so).

How many heads/layers are used in BERT during inference time? A lot? do they differ for dirrent tasks?
Again BERT is probably overparametrized!!
It is clear that while the overall pattern varies between tasks, on average we are better off removing a random head - including those that we identified as encoding meaningful information that should be relevant for most tasks.
Many of the heads can also be switched off without any effect on performance, again pointing at the fact that even the base BERT is severely overparametrized.
In actual facts, do BERT needs a lot of pre-training for the actual usual tasks it is used for? Like the GLUE one?
In other words does it needs a lot of linguistic knowledge?
BERT does not need to be all that smart for these tasks. The fact that BERT can do so well on most GLUE tasks without pre-training suggests that to a large degree they can be solved without much of language knowledge. Instead of verbal reasoning, it may learn to rely on various shortcuts, biases and artifacts in the datasets to arrive at the correct prediction. In that case its self-attention maps do not necessarily have to be meaningful to us










