week 4: big data Flashcards
what is big data
large amjount of user generated content (UGC)
challenges big data
storaging, processing, transfering from storage to computing nodes
5 big data characteristics
volume (data size)
variety: different formats
velocity: speed of change
veracity: uncertainty of data
value: turn data into value
let the data…
speak
dont be fixed on.. but discover..
causality … patterns & correlations
in big data, no need for..
sampling
info extraction:
identify specific pieces of info (data) in an unstructured or semi structured textual document
transform unstructured info in a corpus of different documents or web pages into…
a structured data base
many web pages are generated automatically from…
an underlying database
HTML structure of pages is..
failry specific and regular (semi structured)
wrapper =
extractor for a semi structured website
screen scraping =
process of extracting from html pages
other process to extract info from html pages than screen scraping
API
slots in template are typically filled by…
substring from the document
some slots may have…
a fixed set of prespecified possible fillers that may not occur
some slots may allow…
multiple fillers
some domains may allow…
multiple extracted templates per document
specifying an item to extract for a slot using a regular expression pattern may require
prefiller/preceding pattern to identify proper context & succeeding/postfiller pattern to identify the end of the filler
regular expression operators:
or: | grouping () repititon zero or more: * repetition one or more: + repetition zero or one(optional? sequencing: order of elements cardinality {m,n}: m is min number of reps and n is max number of reps (just 1 is exact number)
regular expression operators for characters
any character: .
word boundary: \b
any digit: \d
escape sequence: e.g. \
extract slots in order:
starting search for the filler of the n+1 slot where the filler for the nth slot ended. Assume slots are in fixed order.
Make patterns specific enough to…
identify each filler always starting from the beginning of the document
if a slot has a fixed set of pre specified possible fillers…
text categorisation can be used to fill the slot (job category, company type). Treat each of possible values of slot as a category and classify the entire document to determine the correct filler.
if extracting from automatically generated web pages … usually work
simple regex patterns
if extracting from more natural, unstructured, human-written text, …. may help
some natural language processing
Sorts of NLP:
POS tagging, syntactic parsing, semantic word categories
extraction patterns can use …
POS or phrase tags (in prefillers/fillers)
rapier system learns 3 regex style patterns for each slot:
prefiller pattern, filler pattern and postfiller pattern