1. Introduction Flashcards
Give some text pre-processing methods.
- Tokenization (Splitting Sentece into word tokens)
- Decompounding (breaking complex words to individual words)
- Lemmatization (bringing a word to base form eg. ate -> eat)
- Morphological Analysis: POS tagging, No, Gender, Tense, Suffixes
- True Case (eg. have all words in lower case)
What are sparse and dense feature vectors?
One hot-vectors (sparse):
Each feature is its own dimension, dim of vector = no of features, features are independent.
Dense vectors:
Each feature is a d-dim vector, similar features have similar vectors, provide better generalization
What is a word embedding?
Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension.
Describe three types of ambiguities that make NLP complex.
- Lexical Ambiguity:
I grabbed the mouse (could be computer mouse or animal mouse) - Grammatical Ambiguities:
The professor said on Monday we will have a test (Is the test on Monday or did he say this on Monday)? - Referential Ambiguities:
I gave them my keys because they were outside. (is outside referring t the keys or the ‘them’?)