Words and Morphology Flashcards
What is lemmatization?
-
What does tokenization accomplish?
-
What is a lemma?
A lemma is the dictionary form of a word. “fly” is a lemma, it is what you would see in a dictionary, “flying”, “flies”, “flew” and “flied” all have the lemma “fly”. They are word-forms of the lemma fly and they all belong to the same “lexeme”.
What is a word-form?
The word-form is the full inflected or derived form of a word.
What is the relationship between a lexeme and a word-form?
word-forms are members of a lexeme. A lexeme is a set of word-forms
What is tokenization?
Processing text to decide/extract “words”
In terms of corpora: what is a “type” and what is a “token”, what is the difference between them?
Types are the number of distinct words in a corpus; if the set of words in the vocabulary is V , the number of types is the vocabulary size |V| Tokens are the total number of running words. In the sentence “eye for an eye”, there are 4 tokens but only 3 types. “eye” is repeated.
What does a “word” usually refer to? Types or Tokens?
Types.
What law do the number of words in a language follow?
Zipf’s law
f×r≈k
where f is the frequency of a word, r its rank among an ordered set of other words and k is some constant
What law do the number of types and tokens in a corpus follow?
“Herdan’s Law” or “Heaps’ Law”.
What is a morpheme?
The elementary unit of morphosyntax that compose word types (morphotactics).”desalination” is made up of the morphemes: de+salin+ate+ion
Give two example of categories of morpheme
stems, bound morphemes, root, pattern, reduplication and affixes
Give examples of types of affixes
suffixes, prefixes, circumfixes, infixes
What is a “bound morpheme”
A morpheme that is used to construct a word type but alone does not have meaning.In fact+u+al - “u” and “al” are bound morphemes.
What is the stem of a set of word-forms?
The part of the word-forms that they all share.–stem of walk, walks, walked, walking, walker, walkers is “walk”.
This can go as far as the stem of “produce” and “production” being “produc” which isn’t a proper morpheme or word.