Lecture 3 Flashcards
N-gram Models, Morphology, Part-of-Speech, Word Senses
Applications of Language Models
The goal of a Language Model is to assign a probability that a sentence (or phrase) will occur in natural uses of the language
Applications:
*Machine Translation: P(high winds tonight) > P(large winds tonight)
*Spell Correction
*The office is about fifteen minuets from my house: P(about fifteen minutes from) > P(about fifteen minuets from)
*Speech Recognition
*P(I saw a van)»_space; P(eyes awe of an)
*Summarization, question-answering, and many other NLP applications
Chain Rules:
computing the joint probability of all the words conditioned by the previous words
Markov Assumption
we can predict the next word based on only one word previous
N-gram Probabilities
the observed frequency (count) of the hole sequence divided by the observed frequency of the preceding, or initial, sequence (sometimes called the maximum likelihood estimation (MLE):
Unigram Model: (word frequencies)
The simplest case is that we predict a sentence probability just based on the probabilities of the words with no preceding words
Bigram Model:
(two word frequencies): prediction based on one previous word:
N-gram Models
We can extend to trigrams, 4-grams, 5-grams
* Each higher number will get a more accurate model, but will be harder to
find examples of the longer word sequences in the corpus
long-distance dependencies:
N-gram Models is an insufficient model of language because language has
long-distance dependencies
Bigrams:
any two words that occur together
Bigram language models:
use the bigram probability, meaning a
conditional probability, to predict how likely it is that the second word follows the first
Smoothing
Every N-gram training matrix is sparse, even for very large corpora (remember Zipfs law)
* There are words that donʼt occur in the training corpus that may occur in future text - known as the unseen
words
* Whenever a probability is 0, it will multiply the entire sequence to be 0
* Solution: estimate the likelihood of unseen N-grams and include a small probability for unseen words
Levels of Language Analysis
1 Phonetic
2 Morphological
3 Lexical
4 Syntactic
5 Semantic
6 Discourse
7 Pragmatic
Speech Processing
- Interpretation of speech sounds within & across words
- sound waves are analyzed and encoded into a digitized signal
Rules used in Phonological Analysis
- Phonetic rules – sounds within words
e.g. When a vowel stands alone, the vowel is usually long - Phonemic rules – variations of pronunciation when words are spoken
together e.g. “r” in “part” vs. in “rose” - Prosodic rules – fluctuation in stress and intonation across a sentence:
rhythm, volume, pitch, tempo, and stress
* e.g. High pitch vs. low pitch
Morphology: The Structure of Words
Morphology is the level of language that deals with the internal structure of
words
* General morphological theory applies to all languages as all natural human
languages have systematic ways of structuring words (even sign language)
* Must be distinguished from morphology of a specific language
* English words are structured differently from German words, although
both languages are historically related
* Both are vastly different from Arabic
Morpheme
A morpheme is a minimal subunit of meaning in a word We can usefully divide morphemes into two classes:
Stems
Affixes
Stems:
The core meaning-bearing units (e.g., happy)
Affixes:
Bits and pieces that adhere to stems to change their meanings and grammatical functions: prefixes, infixes, suffixes, circumfixes (e.g., unhappy)
Inflection:
the combination of stems and affixes where the resulting word has the same
word type (e.g., noun, verb, etc.) as the original. Serves a grammatical purpose that is different from the original but is nevertheless transparently related to the original.
Examples: apple – noun; apples – still a noun
Derivation
creates a new word by changing the category and/or meaning of the base to
which it applies. Can change the grammatical category (part of speech)
sing (verb) > singer (noun)
Derivation can change the meaning
act of singing > one who sings
Derivation is often limited to a certain group of words
You can Clintonize the government, but you canʼt Bushize the government (a phonological restriction)
Use of Morphology in NLP Tasks -1
Stemming
Strip prefixes and / or suffixes to find the base root, which may or may not be an actual word
* Misspellings are inconsequential
Use of Morphology in NLP Tasks -2
Lemmatization
Strip prefixes and / or suffixes to find the base root, which will always be an actual word
* Often based on a word list, such as that
available at WordNet
* Correct spelling if crucial
Use of Morphology in NLP Tasks -3
Part of speech prediction
Knowledge of morphemes for a particular
language can be a powerful aid in guessing
the part of speech for an unknown term
To Stem (Lemma) or Not to Stem (Lemma)
The decisions to stem, lemmatize, remove stop words, and normalize (lowercase) depend on the amount of input documents and the analytic approach. More documents = less need for data reduction.
Part-of-speech Tagging:
Assigning Correct Word Types to Words in the Text
The general purpose of a part-of-speech tagger is to associate each word in a text with its correct lexical-syntactic category (represented by a tag)
Varying terminology:
Parts-of-speech (POS), lexical categories, word classes, morphological classes, lexical tags… Lots of debate within linguistics about the number, nature, and universality of these – AND we’ll completely
ignore this debate
Penn Treebank Tag Set -
NNS
noun, plural
Penn Treebank Tag Set -
NNP
proper noun, singular
Penn Treebank Tag Set -
NNPS
proper noun plural
Penn Treebank Tag Set -
PDT
predeterminer
Penn Treebank Tag Set -
POS
possessive ending
Penn Treebank Tag Set -
PRP
personal pronoun
Penn Treebank Tag Set -
PRP$
possessive pronoun
Penn Treebank Tag Set -
RB
adverb
Penn Treebank Tag Set -
RBR
adverb, comparative
Penn Treebank Tag Set -
RBS
adverb, superlative
Penn Treebank Tag Set -
RP
particle
Penn Treebank Tag Set -
WRB
wh- adverb
POS Tagging Approaches:
Rule-based Approach
Simple and doesn’t require a tagged corpus, but not as accurate as other
approaches.
POS Tagging Approaches:
Stochastic Approaches
- Refers to any approach which incorporates frequencies or probabilities
- Requires a tagged corpus to learn frequencies of words with POS tags
- N-gram taggers: uses the context of (a few) previous tags
- Hidden Markov Model (HMM) taggers: uses the context of the entire
sequence of words and previous tags - This technique has been the most widely used of modern taggers, but has
the problem of unknown words
POS Tagging Approaches:
Classification Taggers
- Uses morphology of word and (a few) surrounding words
- Helps solve the problem of unknown words
Computing the Two Probabilities:
Word likelihood probabilities
- VBZ (third-person singular present
verb): likely to be “is” - Compute P(is|VBZ) by counting in a
labeled corpus
Computing the Two Probabilities:
Tag transition (prior) probabilities
- Determiners likely to precede adjectives and nouns
- That/DT flight/NN
- The/DT yellow/JJ hat/NN
- So we expect P(NN|DT) and P(JJ|DT) to
be high - Compute P(NN|DT) by counting in a labeled corpus
Word Sense:
We say that a word has more than one word sense (meaning) if there is more than one definition.
Word senses may be:
- Coarse-grained, if not many distinctions are made
- Fine-grained, if there are many distinctions of meanings
Lexical Semantics
Lexicons –
list of words (or lexemes or stems) with basic info
Lexical Semantics
Dictionaries –
a lexicon with definitions for each word sense * Most are now available online
Lexical Semantics
Thesauruses –
add synonyms/ antonym for each word sense * WordNet
Lexical Semantics
Semantic networks –
add more semantic relations, including semantic
categories
* WordNet, EuroWordNet
Lexical Semantics
Ontologies –
add rules about entities, concepts and relations, semantic
categories
* UMLS
Lexical Semantics
Semantic Lexicon –
Lexicon where each word is assigned to a semantic class
* LIWC, ANEW, Subjectivity Lexicon
WordNet – A Hand-Curated Word Database
WordNet is a database of facts
about words
* Meanings and the relations
among them
* Words are organized into
clusters of synonyms
* Synsets
Organized into nouns, verbs,
adjectives, and adverbs
* Currently 170,000 synsets
* Available for download, arranged
in separate files (DBs)
Hierarchical Semantic Representations
A semantic network provides
relations for each word sense:
* hypernymy/hyponymy (IS-A),
* hypernyms are more general,
hyponyms are more specific
* meronymy/holonymy (PART-OF),