Module 9 Flashcards
What is NLP?
produces machine-driven analyses of text
Why is NLP a hard problem
Language is ambiguous, multiple people may interpret it differently
Applications of NLP (amn)
- automatic summarization
- machine translation
- named entity recognition
What is corpus
collection of written texts that serve as a dataset
What are token and tokenization
a string of contiguous characters between two spaces can be an integer, real, number with a colon
converting text to tokens
What is text preprocessing + 3 steps
data is not analyzable without pre-processing steps - Noise removal - Lexicon normalization - object standardization
what is noise removal?
removal of all noisy entities in text, not relevant to data
what are stopwords
is, am common words
What is a general approach to noise removal?
- prepare a dictionary of noisy entities and iterate text object by words to eliminate those existing in both
What is lexicon normalization
converts all disparities of the word to normal form
converts high dimensionality to low dimensionality
player, played -> play
what are the most common normalization practices
Stemming and lemmatization
what is lemmatization
gets root of the word -> dictionary headword form
am are is -> be
car cars car’s -> car
what are morphemes
small meaningful units that makeup words
what is stemming
stemming is a rudimentary rule-based process to remove the suffix
- automate(s), automatic, automation reduced to automat
other text preprocessing steps (egs)
encoding-decoding noise
grammar checker
spelling correction
What are text-to features used for and list techniques? (SESW)
- To analyze pre-processed data
- techniques
1. Syntactical Parsing
2. Entities / N-gram / word-based features
3. Statistical features
4. Word embeddings
What is syntactical parsing, what does it involve, and what important attributes
- involves the analysis of words and grammar and their arrangement to show relationships in word
- Dependency on Grammar and Part of Speech (POS) are important
what is dependency grammar?
- class of syntactic text analysis that deals with binary relations between two words
- every relation can be represented in the form of a triplet
What is POS tagging
- define usage and function of a word in the sentence
Describe the POS tagging problem
- to determine POS tag for instance of the word
- words often have more than one POS
where can POS tagging be used? (WINE)
Word sense disambiguation ( book )
Improving word-based features
Normalization and lemmatization
Efficient stopword removal
What are the most important chunk of sentence
Which algorithms are generally ensemble models of rule based parsing etc
Entities, Entity Detection algorithms
What is Named Entity Recognition (NER)
- Process of detecting named entities such as person, location etc from the text
example — {“person”: “Ben”}
What are the three blocks NER has (NPE)
- Noun phrase identification - extracts all noun phrases using dependency parsing and POS
- Phrase classification - all extracted nouns are classified ( location, name etc)
- Entity disambiguation - validation layer on top of results
What is topic modeling, and what does it derive
- the process of automatically identifying topics in a text corpus
- derives the hidden patterns among words in an unsupervised manner
Describe N-grams as features, which ones are more informative, which is most important
- a combination of N-words together is called n-grams
- N> 1 are more informative
- bigrams (N=2) are most important
What operations does Bag Of Words involve
- Tokenization: all words tokenized
- Vocabulary creation: unique words create vocabulary
- Vector creation: vector row is sentence, columns are size of vocabulary
What is TF-IDF, what does it convert?
- a weighted model used for Information retrieval
- converts text documents into vector models
what is TF
- Term frequency = frequency of word in doc / total number of words in doc
what is IDF
Inverse document frequency = log (total number of documents / documents containing word W)
What is significant about TF-IDF
gives relative importance to a term in corpus
What is text classification
a technique to systematically classify a text object
what is text matching/similarity
matching text objects to find similarities
what is Levenshtein distance, list edit operations
minimum number of edits to transform one string into another
insertion, deletion, substitution of single character
what is Phonetic matching
takes keyword as input and produces character string to identify words phonetically similar
helps search large text corpus, correct spelling errors and match relevant names
What is cosine similarity
when text is represented as vector notation, the vectorized similarity can be measured
COS similarity ranges from 0 to 1
Closer to 1 = 2 vectors have same orientation
Closer to 0 = 2 vectors have less similarity
What is text summarization
given article, automatically summarize to produce most basic sentences
what is a machine translation
translate text from one language to another
What is Natural Language Generation and Understanding
Generation
Converting info from computer DB or semantic intents to readable human language
Understanding
converting chunks of text into logical structures for computer programs
What is an optical character recognition
given image representing text, determine corresponding text
what is a document of information
parsing textual data in documents in an analyzable and clean format
What is a Naive Bayesian classifier and input / output
determine the most probable class label for the object - assumed independence
Input - variables are discrete
Output - Probability score (proportional to true probability) and Class label (based on highest probability score)
Use cases of NBC
Spam filtering, fraud detection
Describe the Bayes law
P(C | A) = P ( A & C ) / P(A) = P(A|C) P(C)/ P(A)
- C is class label , A is attribute
How to simplify the Naive assumption
Turn P(A|C) = summation P(aj | cj) P(C|A) = summation P(aj | cj) * P(C)
How to build the naive classifier
Get P(Ci) for all class labels, Get P(aj | Ci ) for all A and C, assign the classifier label that maximized value of naive assumption
List the Naive Bayesian Implementation Considerations
Numerical Underflow
- resulting from multiplying probabilities near 0
- preventable by computing log
Zero probabilities
- unobserved attribute/classifier pairs
- handled by smoothing
List Precision and recall
Precision = TP/(TP+FP) Recall = TP/(TP+FN)
What are the two problems with using VSM
- synonymy - many ways to refer to the same object (car, automobile) - poor recall - small cosine but related
- polysemy - most words have more than one meaning ( model) - poor precision - large cosine but not related
Solution to VSM
Latent Semantic Indexing
List four steps of Latent Semantic analysis
- term by document matrix
- convert matrix entries to weights
- rank reduced singular value decomposition
- Compute similarities between entities in semantic space with cosine
what is SVD
- tool for dimension reduction
- similarity measure based on co-occurrence
- finds optimal projection into low dimensional space
- generalized least squares method