CS4051 Natural Language Processing Flashcards
What does it mean for an LLM to exhibit priming effects?
After seeing a sentence of a particular structure, the LLM is less surprised by an upcoming sentence of a similar structure
Define the following terms? Morphology, Pragmatics, Semantics.
Morphology captures information about the different senses of a word in context.
Pragmatics is the study of meaning in context.
Semantics is the study of meaning references/truth.
What does Zipf’s Law state and what does it imply in model development?
Zipf’s law states that the frequency of a word within a corpus is inversely proportional to its rank, implying that it is important for a model to be trained on a representative corpus, so that rarer words within the test set do no surprise the model.
Define the following terms? Lexicon, Homonym, Word senses.
A lexicon is a collection of lexical entries (words).
Homonyms are words which have the same spelling/pronunciation but different meanings, for example “saw” (the cutting tool) and “saw” (the past tense of see).
Word senses are all the different meanings of a word, for example the senses of “saw” would include the meanings mentioned above.
What are stopwords (give and example) and how can they be identified in a corpus?
Stopwords are commonly occurring words which are usually irrelevant to most NLP tasks. A few examples are “of, the, a”. These can be usually identified as the terms with highest frequency in a corpus or by consulting a pre-computed list of stopwords.
Identify and explain the text pre-processing technique concerned with breaking down text into larger chunks of words?
Sentence splitting consists of breaking the corpus down into sentences. A good heuristic used is to look at punctuation symbols. Abbreviation dictionaries can also be used to determine whether a punctuation symbol marks the end of a sentence.
Identify and explain the text pre-processing technique concerned with breaking down text into words? Also define the vocabulary size?
Tokenisation is the task of converting a sentence/corpus into a sequence of tokens/features. Common methods include using spaces as token boundaries and handling edge cases using heuristics - for example “Mary’s” could be split into “Mary” and “s” or regarded as a single token.
The vocabulary size is simply the number of tokens in a sentence. This can vary based on edge case handling as shown in the given example.
Identify and explain the text pre-processing technique concerned with identifying common names, places, etc. ?
Named Entity Recognition is concerned with identifying token which represent country, people, titles etc. The should be considered as one token, regardless of the number of words they are made up of. For example “New Mexico” should not be “New” and “Mexico”.
What is a common set-up for model development? Please explain the advantages and disadvantages of this setup in regard to model performance. Hint : 3 datasets are usually involved.
A common setup is the training, dev-test, and test datasets split. The training set is used for model training/fitting, while the dev-test set is used to compute the prediction error during model selection, and the test-set is used to compute the generalization error before model deployment.
An advantage of using the additional dev-test set is that the test set is kept hidden from the model for as long as possible, meaning that there is a lower risk of over-fitting.
A requirement of this approach is that the dev-test must be disjoint from the test-set to avoid over-fitting, but still be representative to avoid bias.
What is a useful technique for model evaluation when a good training, dev-test, test split is not achievable? Hint : give the answer in terms of “k-fold”.
Cross-validation allows to test a model when there is not enough quality data for a 3-part split. The general approach is to conduct a k-fold cross-validation on the training dataset, where the training set is split into k equal chunks. k-1 are used for model training and 1 chunk used for calculating the prediction error. The process is repeated k times and the prediction errors averaged together.
What are 4 common intrinsic evaluation metrics when evaluating a machine learning model? Provide formulae for all of them and explain them.
Accuracy measures the percentage of correctly classified inputs in the test set. The “baseline” accuracy is the number of occurrences of the majority class in the test set divided by the size of the test set. For unbalanced datasets, if the model’s accuracy > baseline accuracy, it suggests that the model is likely not guessing the most probable class each time but actually using the training features for the classification task.
Precision is the ratio of items predicted to belong to a certain class which actually belong to that class. It is calculated by P = TP / (TP + FP).
Recall is the number of correctly classified items mentioned in the provided gold-standards. It is calculated by R = TP / (TP + FN)
F1 gives a metric to combine both precision and recall. It is computed as F1 = 2PR/(P + R)
How would you interpret a Confusion Matrix to extract the accuracy of the model, and precision and recall for a particular class?
The accuracy is given by sum(numbers along top-left bottom-right diagonal)/(total number of instances).
The precision for a class is the number appearing on the diagonal / sum of the column indexed by that class
The recall for a class is the number appearing on the diagonal / sum of row indexed by that class.
Decision Trees are a popular machine learning classifier. Explain how they work, its advantages and disadvantages?
Decision trees are flowcharts composed of decision nodes which check feature values and leaf nodes which assign labels to instances. It uses features which best split the data.
Advantages include high interpretability and suitability for classifying hierarchical data.
Disadvantages are that the amount of training data decreases at lower levels, making it a model prone to over-fitting, and that it forces features to be checked in a particular order event if these are independent of each other.
Naive Bayes is a popular machine learning classifier. Explain how it works, including the underlying Naive Bayes assumption, and how smoothing works and what issue it solves?
The prior probability for each label is computed as the label frequency “P(label)”, then the contribution of each feature is combined to the previous computation to obtain a likelihood estimate for each label “P(feature | label) for all features”. Finally, the label with the highest likelihood is chosen as the classification result.
The Naive Bayes assumption states that given a label, all features are statistically independent of each other, meaning that the classifier treats the corpus as a “bag of words”.
Smoothing solves the issue of zero-counts, which is when a feature never occurs in the corpus. By adding a constant value to the counts, the probability is never 0.
Explain what causal language modelling is.
Causal language modelling predicts the next token in a sequence of tokens incrementally, usually basing the prediction off of a limited history of tokens
Explain how n-gram language models work? Also mention the underlying assumption used to limit the required generated token history.
N-gram models select the most likely token based-off of the previous n-1 tokens in the sequence, using the Markov assumption which implies that only a limited history of tokens is relevant for selecting the next one.
Explain the problem of Sparsity in Language, relating your answer to n-gram models?
The capability of the model to generalize depends on how representative the training data is. According to Zipf’s law, many n-grams can appear in the test set but not in the training set. A possible solution for this issue could be to use smoothing.
Explain what the Maximum Likelihood Estimation (MLE) is, relating it back to the Markov assumption in language models?
MLE refers to the process by which a model estimates the probability of a sequence of tokens occurring, using the Markov assumption to reduce the number of parameters required in computation.
Perplexity is a common evaluation metric for Language Models. Please explain what it is and how it can be interpreted?
Perplexity is the inverse probability of the test corpus according to the model, normalized by the corpus size. Lower perplexity means that the model can more easily predict the next word in the test corpus.
Word embedding is a common prerequisite for auto-completion tasks. Name and explain two auto-completion tasks. Do these preserve order information of the generated output? Finally, name a popular pre-trained word embedding model.
Two word embedding tasks are characterised by Skip-gram (given a word, predict its surrounding context) and CBOW (given a context, predict the missing word). Both of these do not preserve the order information, making it a limitation. Popular pre-trained embeddings are GloVe and Word2Vec.
Sentence embedding is used for clustering and retrieval tasks. Explain the two naive approaches used, as well as the 3 families of neural language models used for sentence embeddings.
The first naive approach is to treat each sentence in a corpus as word, and run a CBOW model on it to encode the context. This is challenging due to extreme data sparsity.
The second naive approach is to embed each sentence as the mean of all embeddings of words within that sentence. Limitations include weighting-down stopwords and the fact that word ordering is ignored by the embedding.
The 3 families of models are Long-Short Term Memory (LSTM), Transformers, and LLMs
Name and explain the main usage of two popular LLMs used for sentence embedding?
GPT is an autoregressive model used for text generation, while BERT is used for masked word and next-sentence prediction.
Define morphemes and lexemes?
Morphemes are the smallest possible unit of language, even if it cannot stand on its own. For example “un” in “unusual” is a morpheme.
Lexemes are units of lexical meaning from which a set of words can be reached from. For example the lexeme of “running”, “ran”, “runner” is “run”
Explain what a Stemmer and Lemmatiser do?
Stemmer removes affixes from words, leaving the stem behind. For example stemming “unpleasantly” yields “pleasant”.
Lemmatiser map a word to its lemma, a.k.a. its dictionary entry. For example lemmatising “leaving” yields “leave”.
Explain what POS tagging consists of? Explain the main challenge of POS tagging?
Part-of-speech tagging involves assigning a lexical category to each word in a corpus. POS tagging can then be used to define rules about the syntax of a sentence and use them to generate other sentences of the same structure. Difficulty of POS arises when considering context.
Explain what syntax trees are?
Syntax trees represent how grammar rules combine to form sentences
Explain what depending parsing consists of?
The task of obtaining semantic structure of a sentence by analyzing the relationship between words. Verbs are often the root of the tree, since they are central to their clause (sentence).
What does it mean for a grammar to be context free?
Context-free grammars allow for expansions of productions/rules regardless of the context.
Explain the most basic version of a language model? Include “cross-entropy” in your answer.
The idea behind language models is to assign probabilities to symbols of a language, based on the product of preceding tokens. Cross-Entropy measures how close the distribution of tokens learned by the model is to the true token distribution for the language learnt.
What is one major issue with word embedding models such as Word2Vec?
Word2Vec embeds all senses of a word to the same vector, so the past tense of “see” and the cutting tool “saw” are both embedded the same.
What are two desirable properties of word embeddings?
Words with similar meanings should be close in the vector space (for all languages), and syntax/morphology should sometimes be preserved as well.