Levels of Analysis in NLP Flashcards

1
Q

What are the levels of language?

A

Words to Grammar to Meanings to Inferences. The NLP levels are basically the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In NLP the levels are…?

A
Words = Lexical Analysis
Grammar = Syntactic analysis
Meanings = Semantic analysis
Inferences = Discourse/entailment analysis
  • Lexical analysis (list of all words in the vocabulary). Not trivial in conversational AI.
  • Syntactic analysis: Grammar: structure, parse sentences, tag POS. Assumes you’ve normalized the words. Parser based on rules driven from linguistics and statistics.
  • Semantic analysis: Meanings: What does this verb mean compared to other verbs or nouns?
  • Discourse/entailment analysis: Inferences: Highest level of NLP where implied meanings are not explicitly spelled out.
  • 95% of data science projects that need NLP will be lexical, syntactic, or semantic.
  • The process is asymmetrical going back and forth between layers.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is lexical analysis?

A

-Words: We are dealing with words here. What counts as a word? Is it singular or plural?
-Text and corpus analytics:
Spell correction or OCR (scanning image to text): -Lexical analysis can correct errors. Infer most likely word that was meant. Correct automatically or suggest other options.

Terminology extraction: Extract key terms from documents or collection of documents. Look at word frequency. We don’t have to look at grammar or syntax or semantics. Extract the most prominent terminology from a collection of documents.

Lexical diversity: Total unique vocabulary compared to how long the book is. Automate and scale language education.

Respect the lexical level of analysis! Don’t skip over this!

  • Look at Peter Norvig’s classic post
  • Multi-Candidate Ranking Algorithm Based Spell Correction

Actual process (2 models)

Offline
Indexing Tokens
Build a language model
Build an error model - how many transformations do you have to make to transform one word to another (addition/deletion/replacement of single characters)
Find 1 and 2 edit words and find out if they are real words

Query Time
Candidate generation
Scoring
Presenting suggestions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is syntactical analysis?

A

Grammar: Tags the parts of speech and the structure of the grammar.

  • Sentence by sentence
  • Dividing documents into sentences (can’t just use a period because of abbreviations).
  • If you can’t divide into sentence, you can’t really do grammar parsing.
  • POS tagging (noun, verb, proposition, determine)
  • Penn Treebank Tagset from UPenn has tags for POS. Different kinds of adjectives, nouns, etc. Dozens of parts of speech.
  • Grammar parsing is hierarchical (parse trees)
  • What happens if you feed grammatically incorrect sentences into a grammar parser?
  • Lemmatization
  • Discrete text field analysis (scraping info off web pages). Use syntax to unitize and normalize data. Unitize breaking data points into units that have been jumbled into 1. Normalize consistent nomenclature and units of measures (inches vs. cm). – Smart ETL.
  • Unitize happens all the time because marketers want to improve SEO.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is discourse analysis?

A

Gets into the nuances of human conversation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a lexicon?

A

A machine readable dictionary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is morphology?

A

The study of morphemes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are morphemes?

A

Units that our words are made of, e.g., Run is a morpheme and ing is a morpheme for the word running.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a stemmer?

A

Stemming is reducing a word to its root. For the “running” example, the “run” portion of the word is the root or stemmer. It seems like the content portion of the word. A stemmer strips off morphemes not the root “ing” or “ed.” A word cloud is a good use case for stemming. Stem first an count the stemmed version so that you don’t have moral, morals, morality.
-We want to make sure that words like toys and toy represented, then we just want to know toy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is collocations?

A

Words commonly occurring together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is WordNet?

A

A lexical database of English. Enumeration of word sense for disambiguation. Can help identify that words have multiple meanings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Lemmatization?

A
  • Form that represents a set of related word forms, eg., run, runs, ran, running, the lemma is “run”
  • Lemme is symantically the pure form of a word
  • Better is an exception where the lemme is good
  • Lemme has to do more than a stemmer
  • Uses lexical knowledge bases such as WordNet to obtain word base forms. This is the only difference between Lemmatization and Stemming.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is semantic analysis?

A

Meanings
The process of understanding natural language – meaning of words and their relations to other words
“You shall know a word by the company it keeps”
-Learning what a document is about
-Examples: Named-entity recognition, relationship extraction, Word Sense Disambiguation (WSD), Classification, Tagging, Topic segmentation, sentiment analysis.
-E.g., spotted is ambiguous, but when in the context of dalmation it becomes more clear
-For ambiguous terms like jaguar, search engines, check if logged in or logged out, it will personalize searches for you (personalization). If not logged in, they will try to find a trend based on location or geography.
-Model success is based on top 3 or top 5
-Search for things not strings!
-Semantic types and semantic roles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is named entity extraction (NEE) or named entity recognition (NER)?

A

-Named entity recognition is another term used for the same thing. Named entities are string of words that ar referring to a person, place, or thing. Tells us whether the string of words in a sentence is a named entity or not.
-Walmart example: Query Samsung LEDTV
First match tokens
Then we match entity types (Product, Brand, Display Type, etc) so that we don’t end up with matches like Remote control for TV just because the tokens match.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is relationship extraction?

A
Once you have named entities, this will tell you what the relationship is between them.  For example, Apple and Tim Cook are entities and CEO is the relationship..  
Requires semantics (including ontology) and syntax analysis.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Ontology?

A

An organized view of types of beings and the relations they have to one another. E.g., animal, mineral, vegetable is an ontology. Topology vs. Ontology. Ontology is a tree, top-down, parent-child relationship. Arrows go in one direction. Amazon has built knowledge graphs (graph database). IMDB is owned by Amazon.

17
Q

What is classification?

A
  • Tree structured graph to place documents into categories.
  • Uses ml models like SVM
  • Preexisting taxonomy or hierarchy where all the documents are classified
  • Ideally there is a strict taxonomy (only appearing in 1 place)
  • Curlie is an existing classification schema for web data. A lot of documents belong in two places at once.
  • Amazon is a taxonomy, Internet Advertising Bureau, about.com, CNN, Google Product taxonomy
18
Q

What is topic segmentation?

A

Identifies segments of a document related to one topic based on topic shifts. It’s not the same as a paragraph break because you can have three paragraphs with the same topic for example.

19
Q

What is tagging?

A

Identifying keywords for a document, e.g., sports, injuries, coaching, soccer

20
Q

What is discourse analysis?

A

Highest level of NLP where implied meanings are not explicitly spelled out. The major types are

21
Q

What is a WSD?

A
  • Word sense disambiguation
  • Identifying a polysemic word, which is a word that has multiple meanings.
  • Try to get as many clues (context words) as possible from the words and sentences surrounding it.
  • Develop a basket of context clues for each word to get a sense of the word. You could look 10 words before and after to find the context words.
22
Q

What is anaphora resolution?

A

Deals with pronouns mostly. Other items are this/that.

-Not straightforward because the pronoun is not necessarily referring to the antecedent of the pronoun.

23
Q

What is discourse modeling?

A

Inferences
Being able to predict what’s going to come next
-Based on a theory from Roger Shank’s work on “scripts” or previously Claude Levi-Strauss’s “structuralism” in the 40s or Wittgenstein’s language games, that most interactions fall into a script., e.g, waiter/customer interaction, new neighbor moves in, job interview

24
Q

What is question answering?

A

Providing answers to questions

  • Pre-existing FAQs
  • For questions like “what’s the difference between…”, step 1 would be presenting the answers to both questions
  • Step 2 would be to have an algorithm to determine what’s the same and what’s different. Pull out sentences to describe what’s different.
25
Q

What is textual entailment?

A

Drawing logical conclusions from the text

  • “All men are mortal”
  • “Socrates is a man”
  • Inference: Socrates is mortal

Needs to be done to understand the discourse. We do this naturally in conversation. Imply without spelling out.

26
Q

What is pragmatic analysis?

A

Practical reason that allows people to make inferences beyond explicit text.

  • Inference that does not follow from strict logic like textual entailment does (“It was um, interesting” means he didn’t like it).
  • Uses implicature
  • Very challenging because we have to model unwritten rules of social interactions
27
Q

What is implicature?

A

Conclusion draw from what was said plus unspoken interaction rules. Distinguished from straight implication that’s used in textual entailment.

28
Q

What is a corpora?

A

A large body of linguistic data or text. Some useful text corpora are:

  • Project Gutenberg electronic text archive, which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/
  • For less formal language there is web text that includes overheard in ny, firefox discussion forum, movie script for Pirates of the Caribbean, etc.: nltk.corpus import webtext
  • Corpus of IM chat sessions: from nltk.corpus import nps_chat
  • Brown corpus: the first million-word electronic corpus of English - from nltk.corpus import brown
  • Reuters news corpus: from nltk.corpus import reuters
  • Inaugural address corpus: from nltk.corpus import inaugural
  • Universal declaration of human rights: from nltk.corpus import udhr
29
Q

What is concordancing?

A

An alphabetical index of the principal words in a book or the works of an author with their immediate contexts.
Python example: emma.concordance(“surprize”)

30
Q

What are stopwords?

A

Filler words like pronouns, articles, etc., the, and, of, to, in, for, etc.

31
Q

What is a token?

A

A word, a sequence of characters. It can include alphanumeric and special characters (e.g., 2005-2006).

32
Q

What is tokenization?

A

Convert a string of characters (sentences into a sequence of tokens

33
Q

What is a separator?

A

Tabs, spaces, punctuations, hyphens, apostrophes, periods?

34
Q

Recall and Precision for an aggressive stemming?

A

Low precision, as we lose meaning we lose precision because we are over-reducing for roots