Module 9 Flashcards

1
Q

What is NLP?

A

produces machine-driven analyses of text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is NLP a hard problem

A

Language is ambiguous, multiple people may interpret it differently

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Applications of NLP (amn)

A
  • automatic summarization
  • machine translation
  • named entity recognition
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is corpus

A

collection of written texts that serve as a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are token and tokenization

A

a string of contiguous characters between two spaces can be an integer, real, number with a colon

converting text to tokens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is text preprocessing + 3 steps

A
data is not analyzable without pre-processing
steps
- Noise removal
- Lexicon normalization
- object standardization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is noise removal?

A

removal of all noisy entities in text, not relevant to data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what are stopwords

A

is, am common words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a general approach to noise removal?

A
  • prepare a dictionary of noisy entities and iterate text object by words to eliminate those existing in both
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is lexicon normalization

A

converts all disparities of the word to normal form
converts high dimensionality to low dimensionality
player, played -> play

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what are the most common normalization practices

A

Stemming and lemmatization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is lemmatization

A

gets root of the word -> dictionary headword form
am are is -> be
car cars car’s -> car

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what are morphemes

A

small meaningful units that makeup words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is stemming

A

stemming is a rudimentary rule-based process to remove the suffix
- automate(s), automatic, automation reduced to automat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

other text preprocessing steps (egs)

A

encoding-decoding noise
grammar checker
spelling correction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are text-to features used for and list techniques? (SESW)

A
  • To analyze pre-processed data
  • techniques
    1. Syntactical Parsing
    2. Entities / N-gram / word-based features
    3. Statistical features
    4. Word embeddings
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is syntactical parsing, what does it involve, and what important attributes

A
  • involves the analysis of words and grammar and their arrangement to show relationships in word
  • Dependency on Grammar and Part of Speech (POS) are important
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

what is dependency grammar?

A
  • class of syntactic text analysis that deals with binary relations between two words
  • every relation can be represented in the form of a triplet
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is POS tagging

A
  • define usage and function of a word in the sentence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Describe the POS tagging problem

A
  • to determine POS tag for instance of the word

- words often have more than one POS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

where can POS tagging be used? (WINE)

A

Word sense disambiguation ( book )
Improving word-based features
Normalization and lemmatization
Efficient stopword removal

22
Q

What are the most important chunk of sentence

Which algorithms are generally ensemble models of rule based parsing etc

A

Entities, Entity Detection algorithms

23
Q

What is Named Entity Recognition (NER)

A
  • Process of detecting named entities such as person, location etc from the text
    example — {“person”: “Ben”}
24
Q

What are the three blocks NER has (NPE)

A
  1. Noun phrase identification - extracts all noun phrases using dependency parsing and POS
  2. Phrase classification - all extracted nouns are classified ( location, name etc)
  3. Entity disambiguation - validation layer on top of results
25
Q

What is topic modeling, and what does it derive

A
  • the process of automatically identifying topics in a text corpus
  • derives the hidden patterns among words in an unsupervised manner
26
Q

Describe N-grams as features, which ones are more informative, which is most important

A
  • a combination of N-words together is called n-grams
  • N> 1 are more informative
  • bigrams (N=2) are most important
27
Q

What operations does Bag Of Words involve

A
  1. Tokenization: all words tokenized
  2. Vocabulary creation: unique words create vocabulary
  3. Vector creation: vector row is sentence, columns are size of vocabulary
28
Q

What is TF-IDF, what does it convert?

A
  • a weighted model used for Information retrieval

- converts text documents into vector models

29
Q

what is TF

A
  • Term frequency = frequency of word in doc / total number of words in doc
30
Q

what is IDF

A

Inverse document frequency = log (total number of documents / documents containing word W)

31
Q

What is significant about TF-IDF

A

gives relative importance to a term in corpus

32
Q

What is text classification

A

a technique to systematically classify a text object

33
Q

what is text matching/similarity

A

matching text objects to find similarities

34
Q

what is Levenshtein distance, list edit operations

A

minimum number of edits to transform one string into another

insertion, deletion, substitution of single character

35
Q

what is Phonetic matching

A

takes keyword as input and produces character string to identify words phonetically similar

helps search large text corpus, correct spelling errors and match relevant names

36
Q

What is cosine similarity

A

when text is represented as vector notation, the vectorized similarity can be measured

COS similarity ranges from 0 to 1
Closer to 1 = 2 vectors have same orientation
Closer to 0 = 2 vectors have less similarity

37
Q

What is text summarization

A

given article, automatically summarize to produce most basic sentences

38
Q

what is a machine translation

A

translate text from one language to another

39
Q

What is Natural Language Generation and Understanding

A

Generation
Converting info from computer DB or semantic intents to readable human language

Understanding
converting chunks of text into logical structures for computer programs

40
Q

What is an optical character recognition

A

given image representing text, determine corresponding text

41
Q

what is a document of information

A

parsing textual data in documents in an analyzable and clean format

42
Q

What is a Naive Bayesian classifier and input / output

A

determine the most probable class label for the object - assumed independence

Input - variables are discrete
Output - Probability score (proportional to true probability) and Class label (based on highest probability score)

43
Q

Use cases of NBC

A

Spam filtering, fraud detection

44
Q

Describe the Bayes law

A

P(C | A) = P ( A & C ) / P(A) = P(A|C) P(C)/ P(A)

  • C is class label , A is attribute
45
Q

How to simplify the Naive assumption

A
Turn P(A|C) = summation P(aj | cj) 
P(C|A) = summation P(aj | cj) * P(C)
46
Q

How to build the naive classifier

A

Get P(Ci) for all class labels, Get P(aj | Ci ) for all A and C, assign the classifier label that maximized value of naive assumption

47
Q

List the Naive Bayesian Implementation Considerations

A

Numerical Underflow

  • resulting from multiplying probabilities near 0
  • preventable by computing log

Zero probabilities

  • unobserved attribute/classifier pairs
  • handled by smoothing
48
Q

List Precision and recall

A
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
49
Q

What are the two problems with using VSM

A
  • synonymy - many ways to refer to the same object (car, automobile) - poor recall - small cosine but related
  • polysemy - most words have more than one meaning ( model) - poor precision - large cosine but not related
50
Q

Solution to VSM

A

Latent Semantic Indexing

51
Q

List four steps of Latent Semantic analysis

A
  • term by document matrix
  • convert matrix entries to weights
  • rank reduced singular value decomposition
  • Compute similarities between entities in semantic space with cosine
52
Q

what is SVD

A
  • tool for dimension reduction
  • similarity measure based on co-occurrence
  • finds optimal projection into low dimensional space
  • generalized least squares method