Module 9 Flashcards
What is NLP?
produces machine-driven analyses of text
Why is NLP a hard problem
Language is ambiguous, multiple people may interpret it differently
Applications of NLP (amn)
- automatic summarization
- machine translation
- named entity recognition
What is corpus
collection of written texts that serve as a dataset
What are token and tokenization
a string of contiguous characters between two spaces can be an integer, real, number with a colon
converting text to tokens
What is text preprocessing + 3 steps
data is not analyzable without pre-processing steps - Noise removal - Lexicon normalization - object standardization
what is noise removal?
removal of all noisy entities in text, not relevant to data
what are stopwords
is, am common words
What is a general approach to noise removal?
- prepare a dictionary of noisy entities and iterate text object by words to eliminate those existing in both
What is lexicon normalization
converts all disparities of the word to normal form
converts high dimensionality to low dimensionality
player, played -> play
what are the most common normalization practices
Stemming and lemmatization
what is lemmatization
gets root of the word -> dictionary headword form
am are is -> be
car cars car’s -> car
what are morphemes
small meaningful units that makeup words
what is stemming
stemming is a rudimentary rule-based process to remove the suffix
- automate(s), automatic, automation reduced to automat
other text preprocessing steps (egs)
encoding-decoding noise
grammar checker
spelling correction
What are text-to features used for and list techniques? (SESW)
- To analyze pre-processed data
- techniques
1. Syntactical Parsing
2. Entities / N-gram / word-based features
3. Statistical features
4. Word embeddings
What is syntactical parsing, what does it involve, and what important attributes
- involves the analysis of words and grammar and their arrangement to show relationships in word
- Dependency on Grammar and Part of Speech (POS) are important
what is dependency grammar?
- class of syntactic text analysis that deals with binary relations between two words
- every relation can be represented in the form of a triplet
What is POS tagging
- define usage and function of a word in the sentence
Describe the POS tagging problem
- to determine POS tag for instance of the word
- words often have more than one POS