Natural Language Processing Flashcards
What is NLP? What does a typical workflow look like?
NLP
- NLP stands for Natural Language Processing, which is defined as the application of computational techniques to the analysis and synthesis of natural language and speech.
- In ML the source is known as dataset, but in NLP, we usually talk about the corpus.
Workflow
Whatever the data you have, any NLP problem can be solved by a methodological workflow that
has a sequence of steps.
- We usually start with a collection of documents
- We preprocess those documents such that we can do exploratory data analysis
- We represent relevant features in some usable vector space’
- We apply a given model
What challenges are faced when working with NLP?
The main challenge of NLP is the understanding and modelling of elements within a variable context. In language, words are unique but can have different meanings depending on the context in which they are being evaluated.
Challenges which arises:
- Ambiguity: E.g. Did you see her dress? (getting dressed or talking about the clothes)
- Synonymy: we can express the same idea with different terms.
- Syntax: Other peculiarity of natural language is its structure.
- Coreference: We refer to some concepts we mentioned early on in the conversation without using the exact phrase/name again.
- Normalization vs information: E.g. depending on the task, all words to be lowercased or converted plural terms into singular ones unless we consider dog and dogs as different
- Representation: It is easier to process data when it has continues features since we can obtain a number that is close to the value that we want with a certain error. But we can’t approximate the word ‘tree’ with a certain error-
- Style: Different styles to express the same idea depending on the personality or the intention in a specific scenario. Also, sarcasm and irony
What is the process to clean textual data?
- To clean a text efficiently we need to think schematically
- Start with easy stuff and clean it out
- Review the text, find other stuff you don’t want to have
- Clean it and keep going till the text becomes human readable…almost
Describe the following NLP techniques:
- Bag of words
- Keyness
- Lexical dispersion
Bag of words
Let’s assume we have a corpus, a collection of n documents. Let’s now treat each and every document as a collection of individual words. Bag of words analyzes the most frequently used words.
Useful because:
- This is the simplest representation
- It’s inexpensive
- ‘It gives interesting initial results
Keyness
A measure associated with features that occur differentially across different categories. In other words, keyness gives the distinguishing features of the corpora.
Lexical dispersion
An informative measure which communicates where the term has been used in the text. The measure is called lexical dispersion and one way to visualize is with «x-ray» plots.
In what ways can we carry out text normalization?
Stemming
- Produces morphological variants of a root/base word. It’s a crude way for categorizing words.
- It just chops off letters from the end until the stem is reached.
Lemmatization
- This process looks beyond the truncation of a word and considers the language vocabulary instead
- For example, the lemma of «was» is «be»
- Usually much more informative
What is syntactic parsing?
It’s a way to analyze the structure of a given text. Syntactical parsing involves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words.
Different ways to achieve syntactic parsing:
- Part-Of-Speech Tagging (POS)
- Dependency Parsing
- Named Entity Recognition (NER)
How can topics be modeled and how can one find the optimal numer of topics for a given corpus?
Topic modeling
Topic modeling is the detection and recognition of patterns, topic and arguments in a corpus.
One used model is LDA, a generative statistical model which allows us to explain observations through unobserved characteristics.
Optimal topics
- Detecting optimal model by computing incremental power (Perplexity index)
- NOT RIGOUROUS
- Defining a chi-square test
- RIGOUROUS