Word-level processing Flashcards
intro
What are 5 aspects of NLP
- Machine translation
- Information retrieval
- Sentiment Analysis
- Information Extraction
- Question Answering
What are the 8 levels of classical NLP pipeline
- Tokenization
- Sentence splitting
- part-of-speech tagging
- Morphological analysis
- Names entity recognition
- Syntatic parsing
- Coreference resolution
- other annotators
Symbolic way to build a question-anwering system. What are the pros and cons
Pros
* transparency, any prediction is grounded in a rule or dictionary entry
* generalization-by-default, thanks to recursion of rules
Cons
* creating rules is labor intensive
* systems generalize only within their own scope
What is Eliza
An NLP system designed by Joseph Weinzanbaum in 1966. Goal is to simulate a psychotherapist. It is responsive (essentially asks questions back at the user). Pattern matching the input to generate a substitution-based output.
What is a statistical/machine learning way to built a answer generating machine. Name pros and cons
Pros:
* Interpretability, as the statistics reflect whatever data we processed
* Generalisation-by-default, thanks to the grounding in a symbolic format
* Makes use of large annotated corpora
Cons:
* A reliance on handcrafted features
* Often makes too many independent assumtions to be robust
* not always spot on
Give one example of statistics/machine learning NLP
autocorrect
What is the neural way to create a question answering NLP, Give pros and cons
Pros:
* Can model statistical dependence
* Little to no feature engineering required
* Makes use of large corpora
* Very successful in a wide array of typical NLP tasks
Cons
* very limited transparency
* limited theoretical insights
* need to rediscover features/knowledge encoded in the network, if all
What is the classical NLP pipeline
- Morphology
- Syntax
- Lexical semantics
- Compositional semantics
- Pragmatics
What is morphology
Tokenization, lemmatization
What is syntax
part of speech tagging, grammars and parsing
What is lexical semantics
logical forms, word embedding
What is compositional semantics
sentence embeddings, natural language inference
what is pragmatics
Question answering, dualogue modelling
what is motivation of word-level processing
Preprocessing, before we can do meaningful work we need to preprocess input data into text.
what is segmentation in word-level processing
Splitting a document into a list of sentences.
what is lemmatization
mapping words to their root, so that words with the same root are recognized as such
cars,car,car’s -> car
what is stemming
reducing words to their textual stem by removing affixes
low,lower,lowest -> low
what is a porter stemmer
a rule-based stemmer that repeatedly applies a set of rules
why will normalization not work for different languages
Normalization may not work in the same for different languages, since they will have different morphology
what is the goal of byte-pair encoding
Automatically gather a fixed-size, frequency-based vocabulary
what is the process of byte-pair encoding
Method: after pretokenizing, start with a vocabulary of all characters:
1. Choose the most frequent token pair and add it to the vocabulary
2. Retokenize the words with the new vocabulary
3. Repeat until the desired vocabulary size is reached
how do we perform tokenization
We incrementally try taking the longest subword that is in the vocabulary
what are the three paradigms
approaches in NLP are rule-based, statistical and neural
What is word-level processing
often the first step in the NLP pipeline
What is tokenization
the task of splitting text into sensible subunits
rule-based vs. statistical
tokenization with regular expressions, or using BPE/WordPiece