Lecture 2 Flashcards by Edward Cogan

Text Foundations:
Symbolic Encoding Systems

From a sociolinguistic perspective: any
symbolically encoded language is considered to be text; symbols come in many types

How well did you know this?

Not at all

Perfectly

Text Foundations:
Alphabetic

symbols (letters) to represent vowels and
consonants in words
(e.g., Latin)

How well did you know this?

Not at all

Perfectly

Text Foundations:
Abjads

symbols to represent consonants, with
diacritics (or reused consonants) to represent vowels
(e.g., Arabic, Hebrew)

How well did you know this?

Not at all

Perfectly

Text Foundations:
Abugida/Syllabary

Symbol systems representing
consonants plus inherent vowels
(e.g., Devanagari)

How well did you know this?

Not at all

Perfectly

Text Foundations:
Semanto -phonetic

Symbols that carry packets of
meaning and/or sounds
(e.g., Chinese)

How well did you know this?

Not at all

Perfectly

Tokenization (AKA word segmentation)

Separate the characters into individual words

How well did you know this?

Not at all

Perfectly

Tokenization:
Recognize and deal with punctuation

*Apostrophes (one word
it s vs. two words it s
*Hyphenation (snow laden)
*Periods (keep with abbrev. or separate as sentence markers)

How well did you know this?

Not at all

Perfectly

Morphology (To stem or not to stem?)

How words are formed through processes such as affixation (adding prefixes or suffixes)

How well did you know this?

Not at all

Perfectly

Corpus Linguistics

Following tokenization of a “body ” of text, we can develop a summary analysis the linguistic equivalent of descriptive statistics.

How well did you know this?

Not at all

Perfectly

Corpus Linguistics: Methods 1

The Corpus is an ensemble of text

How well did you know this?

Not at all

Perfectly

Corpus Linguistics: Methods 2

Summary and detailed statistical
analysis

How well did you know this?

Not at all

Perfectly

Corpus Linguistics: Methods 3

Before summarizing: Filter out
junk

How well did you know this?

Not at all

Perfectly

Corpus Linguistics: Methods 4

Normalization issues -
Ignore capitalization at beginning
of sentence? Is They the same
word as they

How well did you know this?

Not at all

Perfectly

Corpus Linguistics: Methods 5

Ignore other capitalization? In a
name such as Unilever Corporation is Corporation the same word as corporation

How well did you know this?

Not at all

Perfectly

Corpus Structures

isolated (scattered)
Categorized (boxes)
overlapping (Venn diagram)
Temporal (over time)

How well did you know this?

Not at all

Perfectly

Terminology for word occurrences:

Study These Flashcards

Tokens
the total number of words

Distinct Tokens (sometimes called word types) the number of distinct words, not counting repetitions sometimes called
vocabulary, notated as |V|

Frequency distribution

Study These Flashcards

a list of all tokens with their frequency, usually sorted in the order of decreasing
frequency

Zipf’s Law (1949)

Study These Flashcards

In a natural language corpus, the frequency of any word is inversely proportional
to its rank in a frequency table

eg
* Most frequent word (rank = 1) is twice as frequent as 2nd most frequent
* Most frequent (rank = 1) is 3 times as frequent as 3rd most frequent, etc.

Rank (r)

Study These Flashcards

The numerical position of a word in a list sorted by decreasing frequency
(f ).

Stopwords

Study These Flashcards

(commonly occurring words such as
“the” will account for a large fraction of text so eliminating them greatly reduces the number of words in a text

Unigrams, Bigrams, Trigrams, Etc.

Study These Flashcards

an nGram is simply a token that spans two or more words

frequency of an n gram

Study These Flashcards

the percentage of times the n gram occurs in all the n grams of the corpus and could be useful in corpus statistics

Bigram frequency

Study These Flashcards

percentage occurrence of the bigram in
the corpus

Mutual information

Study These Flashcards

how the two words/tokens are
associated with each other

Statistic: Pointwise Mutual Information (PMI)

* Given a pair of words, compares probability that the two occur together as a joint event to the probability they occur individually & that their co occurrences are simply the result of chance * The more strongly connected 2 items are, the higher will be their MI value

Regular Expressions

Regular expressions are a miniature programming language for pattern matching. Sometimes people refer to the topic as RegEx

Chomsky's Hierarchy of Languages -1

The greatest complexity is represented by the outer circle no restrictions on how grammar can operate

Chomsky's Hierarchy of Languages -2

Next, a context-sensitive grammar covers most human languages grammar rules are complex and varied parsing is difficult but theoretically possible

Chomsky's Hierarchy of Languages -3

Context free languages include most programming languages like C or Python these have clear rules and can always be parsed

Chomsky's Hierarchy of Languages -4

Regular grammar is the most simplistic at all regular expression languages fit this category

Lecture 2 Flashcards

Corpus Linguistics, Tokenization, Regular Expressions, Zipf's Law, Word Frequencies (30 cards)