Stream-based Text Processing Flashcards

1
Q

What is DFA ?

A

Deterministic Finite Automaton
Formallydefinedasa5-tuple:(Q,Σ,δ,q0,F) – Qisasetofstates
– Σ is an input alphabet
– δ:Q×Σ→Qisatransitionfunction
– q0 ∈ Q is the start state
– F ⊂ Q is a set of final or accepting states

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is NFA ?

A

Non-deterministic Finite Automaton

Formally:(Q,Σ,δ,q0,F)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Reguler expression : Literal ?

A

/words/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

REX: Character class ?

A

/./ (any character)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

REX: any of the characters ?

A

/[abc]/ (a or b or c)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

REX: range of characters ?

A

/[0-9]/, /[a-z]/, /[A-Za-z0-9_-]/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

REX: case sensitive ?

A

/[_-]/

/[A-Z_-]/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
A

start of line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

$

A

end of line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

\s

A

white space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

\S

A

not white space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

\d

A

digit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

\D

A

not digit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

\w

A

word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

\W

A

not word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

(a|b)

A

a or b

17
Q

[^abc]

A

not a or b or c

18
Q

*

A

0 or more

19
Q

+

A

1 or more

20
Q

?

A

0 or 1

21
Q

{3}

A

exactly 3

22
Q

{3,}

A

3 or more

23
Q

{3,5}

A

3 or 4 or 5

24
Q

Perl variable names start with $, @, or % ?

A

$a — a scalar variable
@a — an array variable
%a — an associative array (or hash)

25
Q

Perl read line from input ?

A

<>

26
Q

What is morphemes ?

A

smallest meaningful units of a word

27
Q

What is the function of stems and affixes ?

A

stems provide the “main” meaning,

affixes act as modifiers e.g prefix, suffix

28
Q

What is tokenization ?

A

break plain text into words or tokens,

includes numbers normalize use lower case

29
Q

What is stemming(词干提取) ?

A

map words into stems

30
Q

what is Lemmatization(词形还原) ?

A

convert words in dictionary form
studying to study
studies to study

31
Q

What are the three text/word processing method ?

A

tokenization
stemming
Lemmatization

32
Q

what is Morphological Processes?

A

word transformation that happens as a regular language transformation

33
Q

What are three main Morphological Processes ?

A
  1. inflection,
  2. derivation,
  3. compounding.
34
Q

What is inflection ?

A
a change in the form of a world,
same lexical class
eg, work, working, worked.
35
Q

what is Derivation ?

A
transform to different lexical class
eg teach (verb) → teacher (noun).
36
Q

what is Compounding ?

A

two or more words are combined

lady + bug → ladybug

37
Q

What is Zipf’s law ?

A

the product of rank and frequency of the words in a text is “quite constant,”

38
Q

What is n-grams ?

A

contiguous sequence of n items of input text