Stream-based Text Processing Flashcards

1
Q

What is DFA ?

A

Deterministic Finite Automaton
Formallydefinedasa5-tuple:(Q,Σ,δ,q0,F) – Qisasetofstates
– Σ is an input alphabet
– δ:Q×Σ→Qisatransitionfunction
– q0 ∈ Q is the start state
– F ⊂ Q is a set of final or accepting states

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is NFA ?

A

Non-deterministic Finite Automaton

Formally:(Q,Σ,δ,q0,F)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Reguler expression : Literal ?

A

/words/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

REX: Character class ?

A

/./ (any character)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

REX: any of the characters ?

A

/[abc]/ (a or b or c)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

REX: range of characters ?

A

/[0-9]/, /[a-z]/, /[A-Za-z0-9_-]/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

REX: case sensitive ?

A

/[_-]/

/[A-Z_-]/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
A

start of line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

$

A

end of line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

\s

A

white space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

\S

A

not white space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

\d

A

digit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

\D

A

not digit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

\w

A

word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

\W

A

not word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

(a|b)

17
Q

[^abc]

A

not a or b or c

18
Q

*

19
Q

+

20
Q

?

21
Q

{3}

22
Q

{3,}

23
Q

{3,5}

A

3 or 4 or 5

24
Q

Perl variable names start with $, @, or % ?

A

$a — a scalar variable
@a — an array variable
%a — an associative array (or hash)

25
Perl read line from input ?
<>
26
What is morphemes ?
smallest meaningful units of a word
27
What is the function of stems and affixes ?
stems provide the “main” meaning, | affixes act as modifiers e.g prefix, suffix
28
What is tokenization ?
break plain text into words or tokens, | includes numbers normalize use lower case
29
What is stemming(词干提取) ?
map words into stems
30
what is Lemmatization(词形还原) ?
convert words in dictionary form studying to study studies to study
31
What are the three text/word processing method ?
tokenization stemming Lemmatization
32
what is Morphological Processes?
word transformation that happens as a regular language transformation
33
What are three main Morphological Processes ?
1. inflection, 2. derivation, 3. compounding.
34
What is inflection ?
``` a change in the form of a world, same lexical class eg, work, working, worked. ```
35
what is Derivation ?
``` transform to different lexical class eg teach (verb) → teacher (noun). ```
36
what is Compounding ?
two or more words are combined | lady + bug → ladybug
37
What is Zipf’s law ?
the product of rank and frequency of the words in a text is “quite constant,”
38
What is n-grams ?
contiguous sequence of n items of input text