Natural Language Processing Flashcards

Question 1

Q

What is NLP? What does a typical workflow look like?

Answer

A

NLP

NLP stands for Natural Language Processing, which is defined as the application of computational techniques to the analysis and synthesis of natural language and speech.
In ML the source is known as dataset, but in NLP, we usually talk about the corpus.

Workflow

Whatever the data you have, any NLP problem can be solved by a methodological workflow that
has a sequence of steps.

We usually start with a collection of documents
We preprocess those documents such that we can do exploratory data analysis
We represent relevant features in some usable vector space’
We apply a given model

Question 2

Q

What challenges are faced when working with NLP?

Answer

A

The main challenge of NLP is the understanding and modelling of elements within a variable context. In language, words are unique but can have different meanings depending on the context in which they are being evaluated.

Challenges which arises:

Ambiguity: E.g. Did you see her dress? (getting dressed or talking about the clothes)
Synonymy: we can express the same idea with different terms.
Syntax: Other peculiarity of natural language is its structure.
Coreference: We refer to some concepts we mentioned early on in the conversation without using the exact phrase/name again.
Normalization vs information: E.g. depending on the task, all words to be lowercased or converted plural terms into singular ones unless we consider dog and dogs as different
Representation: It is easier to process data when it has continues features since we can obtain a number that is close to the value that we want with a certain error. But we can’t approximate the word ‘tree’ with a certain error-
Style: Different styles to express the same idea depending on the personality or the intention in a specific scenario. Also, sarcasm and irony

Question 3

Q

What is the process to clean textual data?

Answer

A

To clean a text efficiently we need to think schematically
Start with easy stuff and clean it out
Review the text, find other stuff you don’t want to have
Clean it and keep going till the text becomes human readable…almost

Question 4

Q

Describe the following NLP techniques:

Bag of words
Keyness
Lexical dispersion

Answer

A

Bag of words

Let’s assume we have a corpus, a collection of n documents. Let’s now treat each and every document as a collection of individual words. Bag of words analyzes the most frequently used words.

Useful because:

This is the simplest representation
It’s inexpensive
‘It gives interesting initial results

Keyness

A measure associated with features that occur differentially across different categories. In other words, keyness gives the distinguishing features of the corpora.

Lexical dispersion

An informative measure which communicates where the term has been used in the text. The measure is called lexical dispersion and one way to visualize is with «x-ray» plots.

Question 5

Q

In what ways can we carry out text normalization?

Answer

A

Stemming

Produces morphological variants of a root/base word. It’s a crude way for categorizing words.
It just chops off letters from the end until the stem is reached.

Lemmatization

This process looks beyond the truncation of a word and considers the language vocabulary instead
For example, the lemma of «was» is «be»
Usually much more informative

Question 6

Q

What is syntactic parsing?

Answer

A

It’s a way to analyze the structure of a given text. Syntactical parsing involves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words.

Different ways to achieve syntactic parsing:

Part-Of-Speech Tagging (POS)
Dependency Parsing
Named Entity Recognition (NER)

Question 7

Q

How can topics be modeled and how can one find the optimal numer of topics for a given corpus?

Answer

A

Topic modeling

Topic modeling is the detection and recognition of patterns, topic and arguments in a corpus.

One used model is LDA, a generative statistical model which allows us to explain observations through unobserved characteristics.

Optimal topics

Detecting optimal model by computing incremental power (Perplexity index)
- NOT RIGOUROUS
Defining a chi-square test
- RIGOUROUS