Word-level processing Flashcards by Alexander Bazba

intro

What are 5 aspects of NLP

Machine translation
Information retrieval
Sentiment Analysis
Information Extraction
Question Answering

How well did you know this?

Not at all

Perfectly

How well did you know this?

Not at all

Perfectly

What are the 8 levels of classical NLP pipeline

Tokenization
Sentence splitting
part-of-speech tagging
Morphological analysis
Names entity recognition
Syntatic parsing
Coreference resolution
other annotators

How well did you know this?

Not at all

Perfectly

Symbolic way to build a question-anwering system. What are the pros and cons

Pros
* transparency, any prediction is grounded in a rule or dictionary entry
* generalization-by-default, thanks to recursion of rules
Cons
* creating rules is labor intensive
* systems generalize only within their own scope

How well did you know this?

Not at all

Perfectly

How well did you know this?

Not at all

Perfectly

What is Eliza

An NLP system designed by Joseph Weinzanbaum in 1966. Goal is to simulate a psychotherapist. It is responsive (essentially asks questions back at the user). Pattern matching the input to generate a substitution-based output.

How well did you know this?

Not at all

Perfectly

What is a statistical/machine learning way to built a answer generating machine. Name pros and cons

Pros:
* Interpretability, as the statistics reflect whatever data we processed
* Generalisation-by-default, thanks to the grounding in a symbolic format
* Makes use of large annotated corpora
Cons:
* A reliance on handcrafted features
* Often makes too many independent assumtions to be robust
* not always spot on

How well did you know this?

Not at all

Perfectly

Give one example of statistics/machine learning NLP

autocorrect

How well did you know this?

Not at all

Perfectly

What is the neural way to create a question answering NLP, Give pros and cons

Pros:
* Can model statistical dependence
* Little to no feature engineering required
* Makes use of large corpora
* Very successful in a wide array of typical NLP tasks
Cons
* very limited transparency
* limited theoretical insights
* need to rediscover features/knowledge encoded in the network, if all

How well did you know this?

Not at all

Perfectly

What is the classical NLP pipeline

Morphology
Syntax
Lexical semantics
Compositional semantics
Pragmatics

How well did you know this?

Not at all

Perfectly

What is morphology

Tokenization, lemmatization

How well did you know this?

Not at all

Perfectly

What is syntax

part of speech tagging, grammars and parsing

How well did you know this?

Not at all

Perfectly

What is lexical semantics

logical forms, word embedding

How well did you know this?

Not at all

Perfectly

What is compositional semantics

sentence embeddings, natural language inference

How well did you know this?

Not at all

Perfectly

what is pragmatics

Question answering, dualogue modelling

How well did you know this?

Not at all

Perfectly

what is motivation of word-level processing

Study These Flashcards

Preprocessing, before we can do meaningful work we need to preprocess input data into text.

what is segmentation in word-level processing

Study These Flashcards

Splitting a document into a list of sentences.

what is lemmatization

Study These Flashcards

mapping words to their root, so that words with the same root are recognized as such
cars,car,car’s -> car

what is stemming

Study These Flashcards

reducing words to their textual stem by removing affixes
low,lower,lowest -> low

what is a porter stemmer

Study These Flashcards

a rule-based stemmer that repeatedly applies a set of rules

why will normalization not work for different languages

Study These Flashcards

Normalization may not work in the same for different languages, since they will have different morphology

what is the goal of byte-pair encoding

Study These Flashcards

Automatically gather a fixed-size, frequency-based vocabulary

what is the process of byte-pair encoding

Study These Flashcards

Method: after pretokenizing, start with a vocabulary of all characters:
1. Choose the most frequent token pair and add it to the vocabulary
2. Retokenize the words with the new vocabulary
3. Repeat until the desired vocabulary size is reached

how do we perform tokenization

Study These Flashcards

We incrementally try taking the longest subword that is in the vocabulary

what are the three paradigms

approaches in NLP are rule-based, statistical and neural

What is word-level processing

often the first step in the NLP pipeline

What is tokenization

the task of splitting text into sensible subunits

rule-based vs. statistical

tokenization with regular expressions, or using BPE/WordPiece

Word-level processing Flashcards

(28 cards)