Natural language processing: An unexpected journey Flashcards
What is natural language processing?
Is a field of artificial intelligence that allows machines to read, understand and derive meaning from human languages.
(Terms ‘natural language processing’, ‘computational linguistics’ and ‘human language technologies’ may be thought of as essentially synonymous.)
Why is natural language processing tricky?
Text data is fundamentally discrete. But new words can always be created.
- Zipf’s law and Herdan’s / Heaps’ law
- Language is ambiguous, compositional, recursive, unveils hidden structure
Zipf’s law and Herdan’s law
- Few words are very frequent, and there is a long tail of rare words (Zipf’s law)
- Out-of-vocabulary words are always being discovered (Herdan’s /Heaps’ law). There is no way to discover the full vocabulary from which the text documents are drawn
(draw the graphs)
Ambiguity, composition, recursion and hidden structure (units)
- Language is ambiguous: units can have different meanings (e.g. “fans”)
- Language is compositional: meaning of a unit defined as a function of the meaning of its components (think about the phrase structure levels)
- Language is recursive: phrases can be repeatedly combined (think about this video https://youtu.be/MPWuI9whbEY)
- Language unveils hidden structure: local changes in a sentence might have global effects (e.g. The trophy doesn’t fit into the brown suitcase because it is too {small, large}.)
(think of some examples)
Generative Linguistics
Is a linguistic theory that argues for the existence of a language faculty in all human beings, which encodes a set of abstractions specially designed to facilitate the understanding and production of language.
NLP: Empiricism vs Rationalism
- Empiricism: view that there is no such thing as innate knowledge, and that knowledge is instead derived from experience (machine learning)
- Rationalism: a significant part of the knowledge in the human mind is not derived by the senses but is fixed in advance, presumably by genetic inheritance (linguistic knowledge)
NLP: optimization problem, search module and learning module
Many natural language processing problems can be written mathematically in the form of optimization:
write the formula…
- The search module is responsible for finding the candidate output ŷ with the highest score relative to the input x.
- The learning module is responsible for finding the model parameters θ that maximizes the predictive performance