1. Introduction Flashcards

1
Q

Give some text pre-processing methods.

A
  • Tokenization (Splitting Sentece into word tokens)
  • Decompounding (breaking complex words to individual words)
  • Lemmatization (bringing a word to base form eg. ate -> eat)
  • Morphological Analysis: POS tagging, No, Gender, Tense, Suffixes
  • True Case (eg. have all words in lower case)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are sparse and dense feature vectors?

A

One hot-vectors (sparse):
Each feature is its own dimension, dim of vector = no of features, features are independent.

Dense vectors:
Each feature is a d-dim vector, similar features have similar vectors, provide better generalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a word embedding?

A

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe three types of ambiguities that make NLP complex.

A
  • Lexical Ambiguity:
    I grabbed the mouse (could be computer mouse or animal mouse)
  • Grammatical Ambiguities:
    The professor said on Monday we will have a test (Is the test on Monday or did he say this on Monday)?
  • Referential Ambiguities:
    I gave them my keys because they were outside. (is outside referring t the keys or the ‘them’?)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly