Lecture 2 Flashcards
Corpus Linguistics, Tokenization, Regular Expressions, Zipf's Law, Word Frequencies
Text Foundations:
Symbolic Encoding Systems
From a sociolinguistic perspective: any
symbolically encoded language is considered to be text; symbols come in many types
Text Foundations:
Alphabetic
symbols (letters) to represent vowels and
consonants in words
(e.g., Latin)
Text Foundations:
Abjads
symbols to represent consonants, with
diacritics (or reused consonants) to represent vowels
(e.g., Arabic, Hebrew)
Text Foundations:
Abugida/Syllabary
Symbol systems representing
consonants plus inherent vowels
(e.g., Devanagari)
Text Foundations:
Semanto -phonetic
Symbols that carry packets of
meaning and/or sounds
(e.g., Chinese)
Tokenization (AKA word segmentation)
Separate the characters into individual words
Tokenization:
Recognize and deal with punctuation
*Apostrophes (one word
it s vs. two words it s
*Hyphenation (snow laden)
*Periods (keep with abbrev. or separate as sentence markers)
Morphology (To stem or not to stem?)
How words are formed through processes such as affixation (adding prefixes or suffixes)
Corpus Linguistics
Following tokenization of a “body ” of text, we can develop a summary analysis the linguistic equivalent of descriptive statistics.
Corpus Linguistics: Methods 1
The Corpus is an ensemble of text
Corpus Linguistics: Methods 2
Summary and detailed statistical
analysis
Corpus Linguistics: Methods 3
Before summarizing: Filter out
junk
Corpus Linguistics: Methods 4
Normalization issues -
Ignore capitalization at beginning
of sentence? Is They the same
word as they
Corpus Linguistics: Methods 5
Ignore other capitalization? In a
name such as Unilever Corporation is Corporation the same word as corporation
Corpus Structures
isolated (scattered)
Categorized (boxes)
overlapping (Venn diagram)
Temporal (over time)