Lecture 2 Flashcards
Corpus Linguistics, Tokenization, Regular Expressions, Zipf's Law, Word Frequencies
Text Foundations:
Symbolic Encoding Systems
From a sociolinguistic perspective: any
symbolically encoded language is considered to be text; symbols come in many types
Text Foundations:
Alphabetic
symbols (letters) to represent vowels and
consonants in words
(e.g., Latin)
Text Foundations:
Abjads
symbols to represent consonants, with
diacritics (or reused consonants) to represent vowels
(e.g., Arabic, Hebrew)
Text Foundations:
Abugida/Syllabary
Symbol systems representing
consonants plus inherent vowels
(e.g., Devanagari)
Text Foundations:
Semanto -phonetic
Symbols that carry packets of
meaning and/or sounds
(e.g., Chinese)
Tokenization (AKA word segmentation)
Separate the characters into individual words
Tokenization:
Recognize and deal with punctuation
*Apostrophes (one word
it s vs. two words it s
*Hyphenation (snow laden)
*Periods (keep with abbrev. or separate as sentence markers)
Morphology (To stem or not to stem?)
How words are formed through processes such as affixation (adding prefixes or suffixes)
Corpus Linguistics
Following tokenization of a “body ” of text, we can develop a summary analysis the linguistic equivalent of descriptive statistics.
Corpus Linguistics: Methods 1
The Corpus is an ensemble of text
Corpus Linguistics: Methods 2
Summary and detailed statistical
analysis
Corpus Linguistics: Methods 3
Before summarizing: Filter out
junk
Corpus Linguistics: Methods 4
Normalization issues -
Ignore capitalization at beginning
of sentence? Is They the same
word as they
Corpus Linguistics: Methods 5
Ignore other capitalization? In a
name such as Unilever Corporation is Corporation the same word as corporation
Corpus Structures
isolated (scattered)
Categorized (boxes)
overlapping (Venn diagram)
Temporal (over time)
Terminology for word occurrences:
Tokens
the total number of words
Distinct Tokens (sometimes called word types) the number of distinct words, not counting repetitions sometimes called
vocabulary, notated as |V|
Frequency distribution
a list of all tokens with their frequency, usually sorted in the order of decreasing
frequency
Zipf’s Law (1949)
In a natural language corpus, the frequency of any word is inversely proportional
to its rank in a frequency table
eg
* Most frequent word (rank = 1) is twice as frequent as 2nd most frequent
* Most frequent (rank = 1) is 3 times as frequent as 3rd most frequent, etc.
- Rank (r)
The numerical position of a word in a list sorted by decreasing frequency
(f ).
Stopwords
(commonly occurring words such as
“the” will account for a large fraction of text so eliminating them greatly reduces the number of words in a text
Unigrams, Bigrams, Trigrams, Etc.
an nGram is simply a token that spans two or more words
frequency of an n gram
the percentage of times the n gram occurs in all the n grams of the corpus and could be useful in corpus statistics
Bigram frequency
percentage occurrence of the bigram in
the corpus
Mutual information
how the two words/tokens are
associated with each other
Statistic: Pointwise Mutual Information (PMI)
- Given a pair of words, compares
probability that the two occur together
as a joint event to the probability they
occur individually & that their co
occurrences are simply the result of
chance - The more strongly connected 2
items are, the higher will be their
MI value
Regular Expressions
Regular expressions are a miniature programming language for pattern matching. Sometimes people refer to the topic as RegEx
Chomsky’s Hierarchy of Languages -1
The greatest complexity is represented by the outer circle no restrictions on how grammar can operate
Chomsky’s Hierarchy of Languages -2
Next, a context-sensitive grammar
covers most human languages
grammar rules are complex and varied
parsing is difficult but theoretically
possible
Chomsky’s Hierarchy of Languages -3
Context free languages include most
programming languages like C or
Python these have clear rules and
can always be parsed
Chomsky’s Hierarchy of Languages -4
Regular grammar is the most
simplistic at all regular expression
languages fit this category