Topic 1: Regular Expression Flashcards

1
Q

key components of RE

A

search, string, pattern, corpus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is regular expression

A

language for specifying text search.

expression used to specify a set of strings required for a particular purpose

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is string?

A

sequence of symbols.

in text based search, string is a sequence of alphanumeric character

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is pattern?

A

a specific sequence of character/symbols. useful in RE for text searching

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

a regular expression search require 2 things. what is it?

A

pattern to search..corpus (text to search through)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what are the application of regular expressions?

A
  1. test for a pattern within a string.
  2. use in database for selecting data
  3. substitution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what are the basic patterns in RE

A
  1. case sensitivie/disjunction..with example
  2. negation..with example
  3. range
  4. RE symbols: ? * +
  5. RE: disjunction, precedence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

types of errors and definition

A
  1. false positive

2. false negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what are the efforts to reduce error rate?

A
  1. increase accy / precision

2. increase coverage / recall

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is capture group?

A

usage of parenthesis storing a pattern in memory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is a corpus?

A

a computer-readable collection of text or speech
brown corpus?
brown sentence?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is an utterance?

A

a unit of speech bounded by silence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what are the component in disfluencies?

A

fragments, filled pauses

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

give example of fragments and filled pauses

A
  1. main mainly

2. uh, uhm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

definition of word types and tokens

A

word types are a numer of distinct word in a corpus..tokens are number of running words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

why code switching is required?

A

speakers often use multiple languages in single communication act. give example

17
Q

List three task commonly applied as part of any normalization process

A
  1. segmenting words..or tokenizing
  2. normalizing word formats
  3. segmenting sentences in a running text
18
Q

example of tokenization in UNIX

A
  1. tokenization
  2. sorting
  3. merging upper & lower case
  4. sorting counts
19
Q

what is tokenization, normalization

A
  1. process of segmenting text into words

2. process of making the words into a standard format.

20
Q

what are the issues in tokenization?

A

the usage of symbols. give example

21
Q

goal of tokenizer

A
  1. expand clitic contractions
  2. tokenize multiword expression
  3. normalized token
22
Q

case folding

A

reducing all letters to lower case. however in sentiment analysis, case have sentiments! US vs us is important.

23
Q

word segmentation with max match algorithm

A

some language don’t use spaces to mark word-boundaries…chinese, thai, japanese.
standard algo is maxmatch algo

24
Q

procedure in max match algorithm

A
  1. start pointer at the beginning of string.
  2. find the longest word in dictionary that matches the string starting at pointer.
  3. move pointer over the word
  4. repeat at step 2 onwards.
25
Q

collapsing words: lemmatization

A

lemmatization is a task of determinig that two words have the same root despite surface difference.
carried out using morphological parsing.
morphology is a study of the way words are build up from smaller meaning bearing unit called morphemes.

26
Q

what are the 2 broad class of morphemes

A

stems: central morpheme of the word. supplying main meaning
affixes: additional meanings of various kinds.????

27
Q

what is stemming?

A

reducing terms to their stems. is a crude chopping of affixes

28
Q

Porter stemmer algorithm

A

give example also

  1. reduce s/ss
  2. remove “ing” or “ed”
  3. remove ational, izer, ator
  4. remove al, able, ate
29
Q

sentence segmentation

A

period “.” is quite ambiguous. give example..Dr.. .02% 4.3
build binary classifier to look for . to decides end of sentence/not end of sentence
either hand-written rules, regular expression, machine learning (decision tree)

30
Q

what is string distance

A

measures of how 2 string is alike

31
Q

edit distance

A

let us quantigy that 2 strings are similar or otherwise

32
Q

minimum edit distance path

A

example of shortest path between string intention and execution