Stream-based Text Processing Flashcards
What is DFA ?
Deterministic Finite Automaton
Formallydefinedasa5-tuple:(Q,Σ,δ,q0,F) – Qisasetofstates
– Σ is an input alphabet
– δ:Q×Σ→Qisatransitionfunction
– q0 ∈ Q is the start state
– F ⊂ Q is a set of final or accepting states
What is NFA ?
Non-deterministic Finite Automaton
Formally:(Q,Σ,δ,q0,F)
Reguler expression : Literal ?
/words/
REX: Character class ?
/./ (any character)
REX: any of the characters ?
/[abc]/ (a or b or c)
REX: range of characters ?
/[0-9]/, /[a-z]/, /[A-Za-z0-9_-]/
REX: case sensitive ?
/[_-]/
/[A-Z_-]/
start of line
$
end of line
\s
white space
\S
not white space
\d
digit
\D
not digit
\w
word
\W
not word
(a|b)
a or b
[^abc]
not a or b or c
*
0 or more
+
1 or more
?
0 or 1
{3}
exactly 3
{3,}
3 or more
{3,5}
3 or 4 or 5
Perl variable names start with $, @, or % ?
$a — a scalar variable
@a — an array variable
%a — an associative array (or hash)
Perl read line from input ?
<>
What is morphemes ?
smallest meaningful units of a word
What is the function of stems and affixes ?
stems provide the “main” meaning,
affixes act as modifiers e.g prefix, suffix
What is tokenization ?
break plain text into words or tokens,
includes numbers normalize use lower case
What is stemming(词干提取) ?
map words into stems
what is Lemmatization(词形还原) ?
convert words in dictionary form
studying to study
studies to study
What are the three text/word processing method ?
tokenization
stemming
Lemmatization
what is Morphological Processes?
word transformation that happens as a regular language transformation
What are three main Morphological Processes ?
- inflection,
- derivation,
- compounding.
What is inflection ?
a change in the form of a world, same lexical class eg, work, working, worked.
what is Derivation ?
transform to different lexical class eg teach (verb) → teacher (noun).
what is Compounding ?
two or more words are combined
lady + bug → ladybug
What is Zipf’s law ?
the product of rank and frequency of the words in a text is “quite constant,”
What is n-grams ?
contiguous sequence of n items of input text