1. Introduction Flashcards

Question 1

Q

Give some text pre-processing methods.

Answer

A

Tokenization (Splitting Sentece into word tokens)
Decompounding (breaking complex words to individual words)
Lemmatization (bringing a word to base form eg. ate -> eat)
Morphological Analysis: POS tagging, No, Gender, Tense, Suffixes
True Case (eg. have all words in lower case)

Question 2

Q

What are sparse and dense feature vectors?

Answer

A

One hot-vectors (sparse):
Each feature is its own dimension, dim of vector = no of features, features are independent.

Dense vectors:
Each feature is a d-dim vector, similar features have similar vectors, provide better generalization

Question 3

Q

What is a word embedding?

Answer

A

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension.

Question 4

Q

Describe three types of ambiguities that make NLP complex.

Answer

A

Lexical Ambiguity:
I grabbed the mouse (could be computer mouse or animal mouse)
Grammatical Ambiguities:
The professor said on Monday we will have a test (Is the test on Monday or did he say this on Monday)?
Referential Ambiguities:
I gave them my keys because they were outside. (is outside referring t the keys or the ‘them’?)

1. Introduction Flashcards

(4 cards)