CS4051 Natural Language Processing Flashcards

1
Q

What does it mean for an LLM to exhibit priming effects?

A

After seeing a sentence of a particular structure, the LLM is less surprised by an upcoming sentence of a similar structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define the following terms? Morphology, Pragmatics, Semantics.

A

Morphology captures information about the different senses of a word in context.
Pragmatics is the study of meaning in context.
Semantics is the study of meaning references/truth.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does Zipf’s Law state and what does it imply in model development?

A

Zipf’s law states that the frequency of a word within a corpus is inversely proportional to its rank, implying that it is important for a model to be trained on a representative corpus, so that rarer words within the test set do no surprise the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define the following terms? Lexicon, Homonym, Word senses.

A

A lexicon is a collection of lexical entries (words).
Homonyms are words which have the same spelling/pronunciation but different meanings, for example “saw” (the cutting tool) and “saw” (the past tense of see).
Word senses are all the different meanings of a word, for example the senses of “saw” would include the meanings mentioned above.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are stopwords (give and example) and how can they be identified in a corpus?

A

Stopwords are commonly occurring words which are usually irrelevant to most NLP tasks. A few examples are “of, the, a”. These can be usually identified as the terms with highest frequency in a corpus or by consulting a pre-computed list of stopwords.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Identify and explain the text pre-processing technique concerned with breaking down text into larger chunks of words?

A

Sentence splitting consists of breaking the corpus down into sentences. A good heuristic used is to look at punctuation symbols. Abbreviation dictionaries can also be used to determine whether a punctuation symbol marks the end of a sentence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Identify and explain the text pre-processing technique concerned with breaking down text into words? Also define the vocabulary size?

A

Tokenisation is the task of converting a sentence/corpus into a sequence of tokens/features. Common methods include using spaces as token boundaries and handling edge cases using heuristics - for example “Mary’s” could be split into “Mary” and “s” or regarded as a single token.

The vocabulary size is simply the number of tokens in a sentence. This can vary based on edge case handling as shown in the given example.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Identify and explain the text pre-processing technique concerned with identifying common names, places, etc. ?

A

Named Entity Recognition is concerned with identifying token which represent country, people, titles etc. The should be considered as one token, regardless of the number of words they are made up of. For example “New Mexico” should not be “New” and “Mexico”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a common set-up for model development? Please explain the advantages and disadvantages of this setup in regard to model performance. Hint : 3 datasets are usually involved.

A

A common setup is the training, dev-test, and test datasets split. The training set is used for model training/fitting, while the dev-test set is used to compute the prediction error during model selection, and the test-set is used to compute the generalization error before model deployment.

An advantage of using the additional dev-test set is that the test set is kept hidden from the model for as long as possible, meaning that there is a lower risk of over-fitting.

A requirement of this approach is that the dev-test must be disjoint from the test-set to avoid over-fitting, but still be representative to avoid bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a useful technique for model evaluation when a good training, dev-test, test split is not achievable? Hint : give the answer in terms of “k-fold”.

A

Cross-validation allows to test a model when there is not enough quality data for a 3-part split. The general approach is to conduct a k-fold cross-validation on the training dataset, where the training set is split into k equal chunks. k-1 are used for model training and 1 chunk used for calculating the prediction error. The process is repeated k times and the prediction errors averaged together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are 4 common intrinsic evaluation metrics when evaluating a machine learning model? Provide formulae for all of them and explain them.

A

Accuracy measures the percentage of correctly classified inputs in the test set. The “baseline” accuracy is the number of occurrences of the majority class in the test set divided by the size of the test set. For unbalanced datasets, if the model’s accuracy > baseline accuracy, it suggests that the model is likely not guessing the most probable class each time but actually using the training features for the classification task.

Precision is the ratio of items predicted to belong to a certain class which actually belong to that class. It is calculated by P = TP / (TP + FP).

Recall is the number of correctly classified items mentioned in the provided gold-standards. It is calculated by R = TP / (TP + FN)

F1 gives a metric to combine both precision and recall. It is computed as F1 = 2PR/(P + R)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How would you interpret a Confusion Matrix to extract the accuracy of the model, and precision and recall for a particular class?

A

The accuracy is given by sum(numbers along top-left bottom-right diagonal)/(total number of instances).

The precision for a class is the number appearing on the diagonal / sum of the column indexed by that class

The recall for a class is the number appearing on the diagonal / sum of row indexed by that class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Decision Trees are a popular machine learning classifier. Explain how they work, its advantages and disadvantages?

A

Decision trees are flowcharts composed of decision nodes which check feature values and leaf nodes which assign labels to instances. It uses features which best split the data.

Advantages include high interpretability and suitability for classifying hierarchical data.

Disadvantages are that the amount of training data decreases at lower levels, making it a model prone to over-fitting, and that it forces features to be checked in a particular order event if these are independent of each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Naive Bayes is a popular machine learning classifier. Explain how it works, including the underlying Naive Bayes assumption, and how smoothing works and what issue it solves?

A

The prior probability for each label is computed as the label frequency “P(label)”, then the contribution of each feature is combined to the previous computation to obtain a likelihood estimate for each label “P(feature | label) for all features”. Finally, the label with the highest likelihood is chosen as the classification result.

The Naive Bayes assumption states that given a label, all features are statistically independent of each other, meaning that the classifier treats the corpus as a “bag of words”.

Smoothing solves the issue of zero-counts, which is when a feature never occurs in the corpus. By adding a constant value to the counts, the probability is never 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain what causal language modelling is.

A

Causal language modelling predicts the next token in a sequence of tokens incrementally, usually basing the prediction off of a limited history of tokens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain how n-gram language models work? Also mention the underlying assumption used to limit the required generated token history.

A

N-gram models select the most likely token based-off of the previous n-1 tokens in the sequence, using the Markov assumption which implies that only a limited history of tokens is relevant for selecting the next one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Explain the problem of Sparsity in Language, relating your answer to n-gram models?

A

The capability of the model to generalize depends on how representative the training data is. According to Zipf’s law, many n-grams can appear in the test set but not in the training set. A possible solution for this issue could be to use smoothing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Explain what the Maximum Likelihood Estimation (MLE) is, relating it back to the Markov assumption in language models?

A

MLE refers to the process by which a model estimates the probability of a sequence of tokens occurring, using the Markov assumption to reduce the number of parameters required in computation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Perplexity is a common evaluation metric for Language Models. Please explain what it is and how it can be interpreted?

A

Perplexity is the inverse probability of the test corpus according to the model, normalized by the corpus size. Lower perplexity means that the model can more easily predict the next word in the test corpus.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Word embedding is a common prerequisite for auto-completion tasks. Name and explain two auto-completion tasks. Do these preserve order information of the generated output? Finally, name a popular pre-trained word embedding model.

A

Two word embedding tasks are characterised by Skip-gram (given a word, predict its surrounding context) and CBOW (given a context, predict the missing word). Both of these do not preserve the order information, making it a limitation. Popular pre-trained embeddings are GloVe and Word2Vec.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Sentence embedding is used for clustering and retrieval tasks. Explain the two naive approaches used, as well as the 3 families of neural language models used for sentence embeddings.

A

The first naive approach is to treat each sentence in a corpus as word, and run a CBOW model on it to encode the context. This is challenging due to extreme data sparsity.

The second naive approach is to embed each sentence as the mean of all embeddings of words within that sentence. Limitations include weighting-down stopwords and the fact that word ordering is ignored by the embedding.

The 3 families of models are Long-Short Term Memory (LSTM), Transformers, and LLMs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Name and explain the main usage of two popular LLMs used for sentence embedding?

A

GPT is an autoregressive model used for text generation, while BERT is used for masked word and next-sentence prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Define morphemes and lexemes?

A

Morphemes are the smallest possible unit of language, even if it cannot stand on its own. For example “un” in “unusual” is a morpheme.
Lexemes are units of lexical meaning from which a set of words can be reached from. For example the lexeme of “running”, “ran”, “runner” is “run”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Explain what a Stemmer and Lemmatiser do?

A

Stemmer removes affixes from words, leaving the stem behind. For example stemming “unpleasantly” yields “pleasant”.
Lemmatiser map a word to its lemma, a.k.a. its dictionary entry. For example lemmatising “leaving” yields “leave”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Explain what POS tagging consists of? Explain the main challenge of POS tagging?

A

Part-of-speech tagging involves assigning a lexical category to each word in a corpus. POS tagging can then be used to define rules about the syntax of a sentence and use them to generate other sentences of the same structure. Difficulty of POS arises when considering context.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Explain what syntax trees are?

A

Syntax trees represent how grammar rules combine to form sentences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Explain what depending parsing consists of?

A

The task of obtaining semantic structure of a sentence by analyzing the relationship between words. Verbs are often the root of the tree, since they are central to their clause (sentence).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What does it mean for a grammar to be context free?

A

Context-free grammars allow for expansions of productions/rules regardless of the context.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Explain the most basic version of a language model? Include “cross-entropy” in your answer.

A

The idea behind language models is to assign probabilities to symbols of a language, based on the product of preceding tokens. Cross-Entropy measures how close the distribution of tokens learned by the model is to the true token distribution for the language learnt.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is one major issue with word embedding models such as Word2Vec?

A

Word2Vec embeds all senses of a word to the same vector, so the past tense of “see” and the cutting tool “saw” are both embedded the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are two desirable properties of word embeddings?

A

Words with similar meanings should be close in the vector space (for all languages), and syntax/morphology should sometimes be preserved as well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is a metric used to measure similarity between embeddings? Provide the formula

A

Cosine similarity between two embeddings, measured as the dot product of the two vectors divided by both their magnitude.

33
Q

What is contextualized word embedding and what model is capable of producing these?

A

Contextualized embedders such as BERT can embed words without disregarding their context.

34
Q

Name and explain a heuristic used for document embedding?

A

TF*IDF (term-frequency, inverse-document-frequency) is a heuristic which given a term/query and a collection of documents, attributes higher weights to documents containing a high term frequency and for which the term is common in the document, but uncommon in all other documents in the corpus.

35
Q

What is a technique in bilingual embeddings which can help improve monolingual embeddings?

A

Direct Transfer consists of training an embedding model on a resource-rich language and apply it to a resource poor language. A common limitation with this approach are out-of-vocab words

36
Q

Explain the naive approach for training bilingual embeddings and its main bottleneck?

A

The common approach is to learn a monolingual embedding in a language and map them to the target language using a dictionary. A bottleneck arises if there is not entry in the dictionary for a particular token, meaning a mapping cannot be acheived.

37
Q

Name and explain how two common bilingual embedding models work?

A

BiSkip uses word-aligned and sentence-aligned texts, then runs a SKIP-GRAM model using words from both languages as context.

BiVCD merges and shuffles aligned documents, then runs a monolingual embedding on this output.

38
Q

What is the task of natural language generation?

A

Turning structured data into human-readable text.

39
Q

What are the two approaches of natural language generation? Discuss advantages and disadvantages.

A

Rule-based generation requires manual effort to define rules. It does not generalize well however it produces high quality outputs for the target domain.
Data-driven generation requires no rule-definition and generalises well, however the output quality is not a guarantee and there is a risk of picking up bias from the training data.

40
Q

What is the task of decoding in LLM generation? Name the most common decoding strategy

A

Decoding is the task of choosing a word to generate based on the model’s probabilities. Sampling is a common decoding approach.

41
Q

Name and briefly describe the 6 sampling methods used in decoding during LLM generation?

A

Random sampling : randomly choose a word from the model’s probability distribution
Greedy sampling : always choose the most probable word
Beam Search sampling : expand X sequences and pick the one with the highest cumulative probability.
Top-k sampling : choose a word at random from the top-k most probable ones
Nucleus sampling : choose a word at random from the top-p percent of the probability mass
Temperature sampling : normalize logits by a temperature before applying a soft-max function. As temperature decreases, higher probability words are pushed towards 1 and lower probability ones towards 0.

42
Q

Explain the In-context learning and few-shot prompting are?

A

In-context learning (prompting) is the task of improving the model’s performance without using any gradient-descent based updates on its parameters, usually by using prompts to generate a context for the LLM to work in.

Few-shot prompting consists including labelled example answers to guide the model.

42
Q

Explain broadly how fine-tuning works and what it is used for?

A

Fine tuning consists of adapting the parameters of a pre-trained model to complete tasks in a specific domain.

43
Q

Explain what reference metrics are and provide two example metrics?

A

Reference metrics assess the similarity between generated text for a task and the human-written output. Two common metrics are BLEU to evaluate cross-language machine translations and BERTscore which computes token-wise similarity, taking semantics into account.

44
Q

Explain the task of reference resolution. What are the two things required to accomplish this task?

A

Reference resolution refers to the task of determining the preceding topic being refereed to in a dialogue context. It relies on correct understanding of speaker common ground and speaker intent

45
Q

What are speech acts? Give some example

A

Speech acts convey speaker intent. Examples of these are warnings, requests, and invitations.

46
Q

Explain what we mean by Abductive Reasoning?

A

Inference to the most plausible explanation

47
Q

Explain what Information Retrieval is and what are the two approaches to achieve it?

A

Information Retrieval refers to the task of returning documents from a corpus which are relevant to answering a user query.

The first approach is to encode the documents as sparse vectors, weight them using TF-IDF and then computing the most relevant one by cosine similarity.

The second approach involves encoding the documents and query as dense vectors using BERT and then computing the similarity within the embedding space

48
Q

What is RAG? Explain what it is and its main usage.

A

Retrieval Augmented Generation uses information retrieval to gather relevant documents, then feed them to a LLM to generate an answer.

The main benefit of RAG is that answers are more likely to be grounded in truth.

Question-Answering systems use RAG in the Retriever-Reader architecture.

49
Q

What is alignment in dialogue?

A

The phenomenon by which speakers adapt to each other to aid mutual understanding

50
Q

What are dialogue acts?

A

These are generalized speech acts which also represent grounding and may require inference by one of the actors in the dialogue

51
Q

What is the most trivial approach to dialogue generation?

A

Annotate each utterance/turn with the speaker and concatenate these together to generate a dialogue context.

52
Q

What is the main challenge of dialogue?

A

The main challenge in dialogue is exploiting common ground to understand context in order to generate an utterance/response.

53
Q

What is Surprisal in dialogue, and what can be said about surprisal between speakers?

A

A measure of a speaker’s processing effort. Surprisal has been observed to converge between speakers as the dialogue progresses.

54
Q

What are constructions in dialogues, and what is facilitated by construction repetition in dialogue?

A

Constructions are utterances of 3+ tokens which occur 3+ times within the dialogue.

Constructions repetition facilitates processing in task-oriented dialogues, especially since repetition facilitates information delivery rate between speakers.

55
Q

What is one of the most significant indicators of task success in dialogue?

A

Dialogue length

56
Q

What is the task of machine translation? Explain it and name a few common challenges.

A

Machine translation is the process of using computational techniques to translate text or speech from one language to another. Common challenges include interpreting creativity and style (especially when translating creating pieces such as poetry) and translating from/to low resource languages

57
Q

Give an example of when biases may be learned when translating from a language like Hungarian to English?

A

A possible bias can occur when translating from a gender-neutral language such as Hungarian to a gendered language such as English. For example, “she is a nurse” and “he is a doctor”.

58
Q

Name and explain the 2 extrinsic machine translation evaluation metrics?

A

Two common metrics are Adequacy (how well the translation captures the meaning of the source sentence) and Fluency (how fluent the translation is in the target language)

59
Q

Explain the main idea behind intrinsic metrics for evaluation of machine translation. Name two common scoring metrics and broadly explain what they do?

A

These measure the similarity between the translation output and a human-generated gold standard by calculating the overlap between the two.

BLEU is a scoring metric which captures lexical relatedness and penalizes translations which are too short.

BERTscore captures semantic relatedness. It works by encoding both translations using BERT and computing the token-wise cosine similarity.

60
Q

What is the main objective of summarisation? Explain the three stages of a typical summarisation process.

A

Summarisation aims to produce a condensed version of a document which contains information relevant to answering a user query.

The three stages of a summarisation task are
(1) pre-processing - parse document and extract required contents
(2) feature design and sentence scoring - extract features such as length and use them to score sentences
(3) post-processing - sentence simplification, redundancy removal, and reordering.

61
Q

Explain the difference between extractive vs abstractive summarisation?

A

Extractive summarisation produces summaries consisting entirely of material copied from the input document

Abstractive summarisation consists of material not in the input document i.e., paraphrased

62
Q

Name and explain a popular metric for intrinsic evaluation of summarisation outputs?

A

ROUGE is regarded as a convenient metric to use when human evaluation is not possible. The core idea is to compute the percentage of bi-grams contained in the human-defined gold-standard which also appear in the generated summarisation.

63
Q

What is temporality?

A

Temporality is defined as knowledge change over time

64
Q

Name and explain the three categories of temporal hallucination?

A

Omission occurs when LLMs provide an incomplete or partly correct answer for a query with multiple correct answers.
Fabrication occurs when LLMs provide a made-up answer for queries with no answer.
Misattribution occurs when LLMs answer with the wrong entity for a query whose answer is a proper noun.

65
Q

What is the difference between 1-hop and 2-hop temporal complexity?

A

In 1-hop, the timestamp of the answer is explicitly included in the query, while in 2-hop this is implicit.

66
Q

Name and explain the three intrinsic evaluation metrics for temporal queries?

A

Exact match yields true if the ground truth is contained within the model answer.
F1 measures overlap between the LLM’s answer and the ground truth.
PEDANTS measure the correlation of model output with a human-labelled QA dataset

67
Q

Explain what these evaluation results might tell us about the answer of an LLM to a temporal query?

  1. F1 score = PEDANTS score and both F1 and PEDANTS are larger than the EM score
  2. PEDANTS > F1 > EM
  3. F1 > PEDANTS > EM
  4. F1 = EM and both F1 and EM are larger than PEDANTS
A
  1. The query doesn’t support order swapping - ground truth expects “x … y” therefore “y … x” is not allowed.
  2. Model answer is correct but verbose
  3. Model is likely guilty of omission of information from the answer
  4. Model is likely guilt of fabricating the answer
68
Q

What is a possible explanation for worse results about a temporal query concerned with the distant past, but better results for queries concerned with more recent times?

A

Possible data shortage about what happened a long time ago, there could be some gaps in the data or different formats of data may have been gathered. In general, increased data availability results in better answers.

69
Q

LLMs struggle to understand dates when answering temporal queries. Name a technique by which information can be obtained from elsewhere and used to improve the semantics of the output?

A

RAG - retrieval augmented generation

70
Q

What are some reasons for carrying out intrinsic evaluation in NLP? Name 3 ways in wh

A

Reasons for intrinsic evaluation are assessing performance, assessing generalization, identifying bias and fairness. Common methods for intrinsic evaluation can be using performance metrics (BLUE, perplexity, etc.), using dataset splits, and using human evaluation such as labeling

71
Q

What are some reasons for carrying out extrinsic evaluation in NLP? Name some common task success measures when evaluating with participants.

A

Reasons for extrinsic evaluation include ensuring ethical and safe deployment, assessing levels of trust and UX quality, and ensuring real-word applicability. Common success measures include user satisfaction score, task outcome, and participant behavior during the interaction with the model.

72
Q

Explain the difference between the three types of biases in NLP systems? Data bias, Model bias, Evaluation bias

A

Data bias is embedded in the training data, Model bias is developed or amplified during learning, and Evaluation bias is usually induced by human annotators or biased evaluation metrics.

73
Q

What is a procedure used to tackle data issues in NLP datasets? Hint : this is something that comes included with the training dataset

A

Attaching a Datasheet to a dataset outlines its motivation, composition, and collection process.

74
Q

Data labeling when done by human annotators can induce bias. Explain how this might occur and what is a metric used to quantify the level of agreement between annotators? What is the threshold value for this metric such that the annotation is agreed upon?

A

Humans may have different agreements on what is considered “harmful” for example, hence if a group of annotators is under sensitive, then the data will result contain this preference/bias. Cohen’s Kappa is a statistical measure to quantify the agreement level between two annotators. A k >= 0.44 is acceptable.

75
Q

What is the main concern in regard to sustainability when developing LLMs? What is an unusual technique used to combat growing computing power requirements during model training?

A

Developing LLMs require a lot of power and resources, hence it can be harmful to the environment and require a lot of budget. A popular technique is to use the student-teacher training architecture, where a smaller model (the student) distills knowledge from the larger model (the teacher) to train itself.

76
Q

What is interpretability and why is it important? Give examples of interpretable models.

A

interpretability is defined as the degree to which a human can consistently predict a model’s output/decision process. Interpretability leads to increased model transparency. KNNs and Decision Trees are known for their interpretability.

77
Q

Black-box models can be hard to interpret. Name a typical example of a black box model, and name a common post-hoc technique to improve black-box model interpretability.

A

Neural networks and ensemble models are typically hard to interpret. Local interpretation techniques such as local surrogate models (LIME) can help make these more interpretable by replacing some internal components with interpretable models. For example Random Forests with Decision Trees (known for their interpretability)

78
Q

What is an advantage of using attribution techniques when calculating the interpretability of a model?

A

Attribution techniques allow us to investigate what aspects of the context lead to a specific output, often achieved by weighting the influence of each token in the input on the output.