Lecture 2 Flashcards

Corpus Linguistics, Tokenization, Regular Expressions, Zipf's Law, Word Frequencies

1
Q

Text Foundations:
Symbolic Encoding Systems

A

From a sociolinguistic perspective: any
symbolically encoded language is considered to be text; symbols come in many types

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Text Foundations:
Alphabetic

A

symbols (letters) to represent vowels and
consonants in words
(e.g., Latin)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Text Foundations:
Abjads

A

symbols to represent consonants, with
diacritics (or reused consonants) to represent vowels
(e.g., Arabic, Hebrew)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Text Foundations:
Abugida/Syllabary

A

Symbol systems representing
consonants plus inherent vowels
(e.g., Devanagari)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Text Foundations:
Semanto -phonetic

A

Symbols that carry packets of
meaning and/or sounds
(e.g., Chinese)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Tokenization (AKA word segmentation)

A

Separate the characters into individual words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Tokenization:
Recognize and deal with punctuation

A

*Apostrophes (one word
it s vs. two words it s
*Hyphenation (snow laden)
*Periods (keep with abbrev. or separate as sentence markers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Morphology (To stem or not to stem?)

A

How words are formed through processes such as affixation (adding prefixes or suffixes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Corpus Linguistics

A

Following tokenization of a “body ” of text, we can develop a summary analysis the linguistic equivalent of descriptive statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Corpus Linguistics: Methods 1

A

The Corpus is an ensemble of text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Corpus Linguistics: Methods 2

A

Summary and detailed statistical
analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Corpus Linguistics: Methods 3

A

Before summarizing: Filter out
junk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Corpus Linguistics: Methods 4

A

Normalization issues -
Ignore capitalization at beginning
of sentence? Is They the same
word as they

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Corpus Linguistics: Methods 5

A

Ignore other capitalization? In a
name such as Unilever Corporation is Corporation the same word as corporation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Corpus Structures

A

isolated (scattered)
Categorized (boxes)
overlapping (Venn diagram)
Temporal (over time)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Terminology for word occurrences:

A

Tokens
the total number of words

Distinct Tokens (sometimes called word types) the number of distinct words, not counting repetitions sometimes called
vocabulary, notated as |V|

17
Q

Frequency distribution

A

a list of all tokens with their frequency, usually sorted in the order of decreasing
frequency

18
Q

Zipf’s Law (1949)

A

In a natural language corpus, the frequency of any word is inversely proportional
to its rank in a frequency table

eg
* Most frequent word (rank = 1) is twice as frequent as 2nd most frequent
* Most frequent (rank = 1) is 3 times as frequent as 3rd most frequent, etc.

19
Q
  • Rank (r)
A

The numerical position of a word in a list sorted by decreasing frequency
(f ).

20
Q

Stopwords

A

(commonly occurring words such as
“the” will account for a large fraction of text so eliminating them greatly reduces the number of words in a text

21
Q

Unigrams, Bigrams, Trigrams, Etc.

A

an nGram is simply a token that spans two or more words

22
Q

frequency of an n gram

A

the percentage of times the n gram occurs in all the n grams of the corpus and could be useful in corpus statistics

23
Q

Bigram frequency

A

percentage occurrence of the bigram in
the corpus

24
Q

Mutual information

A

how the two words/tokens are
associated with each other

25
Q

Statistic: Pointwise Mutual Information (PMI)

A
  • Given a pair of words, compares
    probability that the two occur together
    as a joint event to the probability they
    occur individually & that their co
    occurrences are simply the result of
    chance
  • The more strongly connected 2
    items are, the higher will be their
    MI value
26
Q

Regular Expressions

A

Regular expressions are a miniature programming language for pattern matching. Sometimes people refer to the topic as RegEx

27
Q

Chomsky’s Hierarchy of Languages -1

A

The greatest complexity is represented by the outer circle no restrictions on how grammar can operate

28
Q

Chomsky’s Hierarchy of Languages -2

A

Next, a context-sensitive grammar
covers most human languages
grammar rules are complex and varied
parsing is difficult but theoretically
possible

29
Q

Chomsky’s Hierarchy of Languages -3

A

Context free languages include most
programming languages like C or
Python these have clear rules and
can always be parsed

30
Q

Chomsky’s Hierarchy of Languages -4

A

Regular grammar is the most
simplistic at all regular expression
languages fit this category