Word-level processing Flashcards

1
Q

intro

What are 5 aspects of NLP

A
  1. Machine translation
  2. Information retrieval
  3. Sentiment Analysis
  4. Information Extraction
  5. Question Answering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the 8 levels of classical NLP pipeline

A
  1. Tokenization
  2. Sentence splitting
  3. part-of-speech tagging
  4. Morphological analysis
  5. Names entity recognition
  6. Syntatic parsing
  7. Coreference resolution
  8. other annotators
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Symbolic way to build a question-anwering system. What are the pros and cons

A

Pros
* transparency, any prediction is grounded in a rule or dictionary entry
* generalization-by-default, thanks to recursion of rules
Cons
* creating rules is labor intensive
* systems generalize only within their own scope

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Eliza

A

An NLP system designed by Joseph Weinzanbaum in 1966. Goal is to simulate a psychotherapist. It is responsive (essentially asks questions back at the user). Pattern matching the input to generate a substitution-based output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a statistical/machine learning way to built a answer generating machine. Name pros and cons

A

Pros:
* Interpretability, as the statistics reflect whatever data we processed
* Generalisation-by-default, thanks to the grounding in a symbolic format
* Makes use of large annotated corpora
Cons:
* A reliance on handcrafted features
* Often makes too many independent assumtions to be robust
* not always spot on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Give one example of statistics/machine learning NLP

A

autocorrect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the neural way to create a question answering NLP, Give pros and cons

A

Pros:
* Can model statistical dependence
* Little to no feature engineering required
* Makes use of large corpora
* Very successful in a wide array of typical NLP tasks
Cons
* very limited transparency
* limited theoretical insights
* need to rediscover features/knowledge encoded in the network, if all

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the classical NLP pipeline

A
  1. Morphology
  2. Syntax
  3. Lexical semantics
  4. Compositional semantics
  5. Pragmatics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is morphology

A

Tokenization, lemmatization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is syntax

A

part of speech tagging, grammars and parsing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is lexical semantics

A

logical forms, word embedding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is compositional semantics

A

sentence embeddings, natural language inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is pragmatics

A

Question answering, dualogue modelling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is motivation of word-level processing

A

Preprocessing, before we can do meaningful work we need to preprocess input data into text.

17
Q

what is segmentation in word-level processing

A

Splitting a document into a list of sentences.

18
Q

what is lemmatization

A

mapping words to their root, so that words with the same root are recognized as such
cars,car,car’s -> car

19
Q

what is stemming

A

reducing words to their textual stem by removing affixes
low,lower,lowest -> low

20
Q

what is a porter stemmer

A

a rule-based stemmer that repeatedly applies a set of rules

21
Q

why will normalization not work for different languages

A

Normalization may not work in the same for different languages, since they will have different morphology

22
Q

what is the goal of byte-pair encoding

A

Automatically gather a fixed-size, frequency-based vocabulary

23
Q

what is the process of byte-pair encoding

A

Method: after pretokenizing, start with a vocabulary of all characters:
1. Choose the most frequent token pair and add it to the vocabulary
2. Retokenize the words with the new vocabulary
3. Repeat until the desired vocabulary size is reached

24
Q

how do we perform tokenization

A

We incrementally try taking the longest subword that is in the vocabulary

25
Q

what are the three paradigms

A

approaches in NLP are rule-based, statistical and neural

26
Q

What is word-level processing

A

often the first step in the NLP pipeline

27
Q

What is tokenization

A

the task of splitting text into sensible subunits

28
Q

rule-based vs. statistical

A

tokenization with regular expressions, or using BPE/WordPiece