GENERAL_ML_CULTURE Flashcards
Here we put info from Blog and posts. Where is the world of AI going, hot topics etc
What are things it would be great to do in NLP in 2020?
- Learning from few samples rather than from large datasets
- Compact and efficient rather than huge models
- Evaluate on at least another language (from a different language family)
- New datasets contain at least one other language
- Characterize wrong utterances? what is the linguistic cause of that?
NeurIPS 2019
What is GLUE in NLP?
In an April 2018 paper coauthored with collaborators from the University of Washington and DeepMind, the Google-owned artificial intelligence company, Bowman introduced a battery of nine reading-comprehension tasks for computers called GLUE (General Language Understanding Evaluation). The test was designed as “a fairly representative sample of what the research community thought were interesting challenges,” said Bowman, but also “pretty straightforward for humans.” For example, one task asks whether a sentence is true based on information offered in a preceding sentence. If you can tell that “President Trump landed in Iraq for the start of a seven-day visit” implies that “President Trump is on an overseas visit,” you’ve just passed.
What is the current status of NLP system in the GLUE challenge?
State of the art was around 60% before 2018.
Then
Who developed GPT and GPT-2
OpenAI
In non technical term: what are the 3 main ingredients of BERT?
a deep pretrained language model, attention and bidirectionality
They existed existed independently before BERT. But until Google released its recipe in late 2018, no one had combined them in such a powerful way.
Tell more in non technical term about bidirectionalty in BERT
Finally, the third ingredient in BERT’s recipe takes nonlinear reading one step further.
Unlike other pretrained language models, many of which are created by having neural networks read terabytes of text from left to right, BERT’s model reads left to right and right to left at the same time, and learns to predict words in the middle that have been randomly masked from view.
In non technical term what is attention?
is the ability to figure out which features of a sentence are most important.
state-of-the-art neural networks also suffered from a built-in constraint: They all looked through the sequence of words one by one.
a mechanism that lets each layer of the network assign more weight to some specific features of the input than to others. This new attention-focused architecture, called a transformer, could take a sentence like “a dog bites the man” as input and encode each word in many different ways in parallel. For example, a transformer might connect “bites” and “man” together as verb and object, while ignoring “a”
This treelike representation of sentences gave transformers a powerful way to model contextual meaning, and also to efficiently learn associations between words that might be far away from each other in complex sentences.
What were the 2 main word embedding the old NLP models?
Word2Vec and Glove
What is the main limitation of using pre-trained NLP word embedding?
Though these pretrained word embeddings have been immensely influential, they have a major limitation: they only incorporate previous knowledge in the first layer of the model—the rest of the network still needs to be trained from scratch.
Using word embeddings is like initializing a computer vision model with pretrained representations that only encode edges: they will be helpful for many tasks, but they fail to capture higher-level information that might be even more useful. A model initialized with word embeddings needs to learn from scratch not only to disambiguate words, but also to derive meaning from a sequence of words.
In what sense can ULMFiT, ELMo, and the OpenAI GPT and BERT be considered as the imagenet for language?
one key paradigm shift: going from just initializing the first layer of our models to pretraining the entire model with hierarchical representations. If learning word vectors is like only learning edges, these approaches are like learning the full hierarchy of features, from edges to shapes to high-level semantic concepts.
“ImageNet for language”—that is, a task that enables models to learn higher-level nuances of language, similarly to how ImageNet has enabled training of CV models that learn general-purpose features of images.
What are the main aspects that are true both for Image net deep CV and BERT and co in NLP?
* training data are as important as the algorithm
* transfer learning of network trained on huge datasets is essential
what are some open problems in face recognition?
including multi-camera tracking, re-identification (when someone exits the frame and then re-enters), robustness to occasional camera outages, and automatic multi-camera calibration. Such capabilities will advance significantly in the next few years.
What is self-supervised learning? Give examples of a field where they are used a lot
self-supervised learning. It’s similar to supervised learning, but instead of training the system to map data examples to a classification, we mask some examples and ask the machine to predict the missing pieces. For instance, we might mask some frames of a video and train the machine to fill in the blanks based on the remaining frames.
They are the key of NLP, Models such as BERT, RoBERTa, XLNet, and XLM are trained in a self-supervised manner to predict words missing from a text.
What is one of the most fascinating thing about reinforcement learning?
But there’s a problem here: to be able to collect rewards, some “non-special” actions are needed to be taken — you have to walk towards the coins before you can collect them. So an Agent must learn how to handle postponed rewards by learning to link those to the actions that really caused them. In my opinion, this is the most fascinating thing in Reinforcement Learning.
What was the main highlight for nlp in 2019!
Definitely Transformers
What does BERT stands for ?
BERT stands for Bidirectional Encoder Representations from Transformers.
This model is basically a multi-layer bidirectional Transformer encoder (Devlin, Chang, Lee, & Toutanova, 2019), and there are multiple excellent guides about how it works generally, including the Illustrated Transformer. What we focus on is one specific component of Transformer architecture known as self-attention. In a nutshell, it is a way to weigh the components of the input and output sequences so as to model relations between them, even long-distance dependencies.
Is it certain that BERT success is due to self attention?
It is still under debate see the blogpost here
https://text-machine-lab.github.io/blog/2020/bert-secrets/?utm_campaign=NLP%20News&utm_medium=email&utm_source=Revue%20newsletter
Not just in practice, looking at activated heads.
Even in principle it might be that is not the case
What about the number of parameters in BERT is that too little enough or too much for the usual task?
BERT is heavily overparametrized.
In our experiments we disabled only one head at a time, and the fact that in most cases the model performance did not suffer suggests that many heads have functional duplicates, i.e. disabling one head would not harm the model because the same information is available elsewhere.
In general do weights change that much during fine tuning ?
While accuracy increases a lot! during fine tuning weights do not change that much.
We see that most attention weights do not change all that much, and for most tasks, the last two layers show the most change. These changes do not appear to favor any specific types of meaningful attention patterns.