Topic 1: Regular Expression Flashcards
key components of RE
search, string, pattern, corpus
what is regular expression
language for specifying text search.
expression used to specify a set of strings required for a particular purpose
what is string?
sequence of symbols.
in text based search, string is a sequence of alphanumeric character
what is pattern?
a specific sequence of character/symbols. useful in RE for text searching
a regular expression search require 2 things. what is it?
pattern to search..corpus (text to search through)
what are the application of regular expressions?
- test for a pattern within a string.
- use in database for selecting data
- substitution
what are the basic patterns in RE
- case sensitivie/disjunction..with example
- negation..with example
- range
- RE symbols: ? * +
- RE: disjunction, precedence
types of errors and definition
- false positive
2. false negative
what are the efforts to reduce error rate?
- increase accy / precision
2. increase coverage / recall
what is capture group?
usage of parenthesis storing a pattern in memory.
what is a corpus?
a computer-readable collection of text or speech
brown corpus?
brown sentence?
what is an utterance?
a unit of speech bounded by silence
what are the component in disfluencies?
fragments, filled pauses
give example of fragments and filled pauses
- main mainly
2. uh, uhm
definition of word types and tokens
word types are a numer of distinct word in a corpus..tokens are number of running words
why code switching is required?
speakers often use multiple languages in single communication act. give example
List three task commonly applied as part of any normalization process
- segmenting words..or tokenizing
- normalizing word formats
- segmenting sentences in a running text
example of tokenization in UNIX
- tokenization
- sorting
- merging upper & lower case
- sorting counts
what is tokenization, normalization
- process of segmenting text into words
2. process of making the words into a standard format.
what are the issues in tokenization?
the usage of symbols. give example
goal of tokenizer
- expand clitic contractions
- tokenize multiword expression
- normalized token
case folding
reducing all letters to lower case. however in sentiment analysis, case have sentiments! US vs us is important.
word segmentation with max match algorithm
some language don’t use spaces to mark word-boundaries…chinese, thai, japanese.
standard algo is maxmatch algo
procedure in max match algorithm
- start pointer at the beginning of string.
- find the longest word in dictionary that matches the string starting at pointer.
- move pointer over the word
- repeat at step 2 onwards.
collapsing words: lemmatization
lemmatization is a task of determinig that two words have the same root despite surface difference.
carried out using morphological parsing.
morphology is a study of the way words are build up from smaller meaning bearing unit called morphemes.
what are the 2 broad class of morphemes
stems: central morpheme of the word. supplying main meaning
affixes: additional meanings of various kinds.????
what is stemming?
reducing terms to their stems. is a crude chopping of affixes
Porter stemmer algorithm
give example also
- reduce s/ss
- remove “ing” or “ed”
- remove ational, izer, ator
- remove al, able, ate
sentence segmentation
period “.” is quite ambiguous. give example..Dr.. .02% 4.3
build binary classifier to look for . to decides end of sentence/not end of sentence
either hand-written rules, regular expression, machine learning (decision tree)
what is string distance
measures of how 2 string is alike
edit distance
let us quantigy that 2 strings are similar or otherwise
minimum edit distance path
example of shortest path between string intention and execution