Topic 1: Regular Expression Flashcards by Y TH

key components of RE

search, string, pattern, corpus

How well did you know this?

Not at all

Perfectly

what is regular expression

language for specifying text search.

expression used to specify a set of strings required for a particular purpose

How well did you know this?

Not at all

Perfectly

what is string?

sequence of symbols.

in text based search, string is a sequence of alphanumeric character

How well did you know this?

Not at all

Perfectly

what is pattern?

a specific sequence of character/symbols. useful in RE for text searching

How well did you know this?

Not at all

Perfectly

a regular expression search require 2 things. what is it?

pattern to search..corpus (text to search through)

How well did you know this?

Not at all

Perfectly

what are the application of regular expressions?

test for a pattern within a string.
use in database for selecting data
substitution

How well did you know this?

Not at all

Perfectly

what are the basic patterns in RE

case sensitivie/disjunction..with example
negation..with example
range
RE symbols: ? * +
RE: disjunction, precedence

How well did you know this?

Not at all

Perfectly

types of errors and definition

false positive

2. false negative

How well did you know this?

Not at all

Perfectly

what are the efforts to reduce error rate?

increase accy / precision

2. increase coverage / recall

How well did you know this?

Not at all

Perfectly

what is capture group?

usage of parenthesis storing a pattern in memory.

How well did you know this?

Not at all

Perfectly

what is a corpus?

a computer-readable collection of text or speech
brown corpus?
brown sentence?

How well did you know this?

Not at all

Perfectly

what is an utterance?

a unit of speech bounded by silence

How well did you know this?

Not at all

Perfectly

what are the component in disfluencies?

fragments, filled pauses

How well did you know this?

Not at all

Perfectly

give example of fragments and filled pauses

main mainly

2. uh, uhm

How well did you know this?

Not at all

Perfectly

definition of word types and tokens

word types are a numer of distinct word in a corpus..tokens are number of running words

How well did you know this?

Not at all

Perfectly

why code switching is required?

Study These Flashcards

speakers often use multiple languages in single communication act. give example

List three task commonly applied as part of any normalization process

Study These Flashcards

segmenting words..or tokenizing
normalizing word formats
segmenting sentences in a running text

example of tokenization in UNIX

Study These Flashcards

tokenization
sorting
merging upper & lower case
sorting counts

what is tokenization, normalization

Study These Flashcards

process of segmenting text into words

2. process of making the words into a standard format.

what are the issues in tokenization?

Study These Flashcards

the usage of symbols. give example

goal of tokenizer

Study These Flashcards

expand clitic contractions
tokenize multiword expression
normalized token

case folding

Study These Flashcards

reducing all letters to lower case. however in sentiment analysis, case have sentiments! US vs us is important.

word segmentation with max match algorithm

Study These Flashcards

some language don’t use spaces to mark word-boundaries…chinese, thai, japanese.
standard algo is maxmatch algo

procedure in max match algorithm

Study These Flashcards

start pointer at the beginning of string.
find the longest word in dictionary that matches the string starting at pointer.
move pointer over the word
repeat at step 2 onwards.

collapsing words: lemmatization

lemmatization is a task of determinig that two words have the same root despite surface difference. carried out using morphological parsing. morphology is a study of the way words are build up from smaller meaning bearing unit called morphemes.

what are the 2 broad class of morphemes

stems: central morpheme of the word. supplying main meaning affixes: additional meanings of various kinds.????

what is stemming?

reducing terms to their stems. is a crude chopping of affixes

Porter stemmer algorithm

give example also 1. reduce s/ss 2. remove "ing" or "ed" 3. remove ational, izer, ator 4. remove al, able, ate

sentence segmentation

period "." is quite ambiguous. give example..Dr.. .02% 4.3 build binary classifier to look for . to decides end of sentence/not end of sentence either hand-written rules, regular expression, machine learning (decision tree)

what is string distance

measures of how 2 string is alike

edit distance

let us quantigy that 2 strings are similar or otherwise

minimum edit distance path

example of shortest path between string intention and execution

Topic 1: Regular Expression Flashcards

(32 cards)