week 4: big data Flashcards
what is big data
large amjount of user generated content (UGC)
challenges big data
storaging, processing, transfering from storage to computing nodes
5 big data characteristics
volume (data size)
variety: different formats
velocity: speed of change
veracity: uncertainty of data
value: turn data into value
let the data…
speak
dont be fixed on.. but discover..
causality … patterns & correlations
in big data, no need for..
sampling
info extraction:
identify specific pieces of info (data) in an unstructured or semi structured textual document
transform unstructured info in a corpus of different documents or web pages into…
a structured data base
many web pages are generated automatically from…
an underlying database
HTML structure of pages is..
failry specific and regular (semi structured)
wrapper =
extractor for a semi structured website
screen scraping =
process of extracting from html pages
other process to extract info from html pages than screen scraping
API
slots in template are typically filled by…
substring from the document
some slots may have…
a fixed set of prespecified possible fillers that may not occur
some slots may allow…
multiple fillers
some domains may allow…
multiple extracted templates per document
specifying an item to extract for a slot using a regular expression pattern may require
prefiller/preceding pattern to identify proper context & succeeding/postfiller pattern to identify the end of the filler
regular expression operators:
or: | grouping () repititon zero or more: * repetition one or more: + repetition zero or one(optional? sequencing: order of elements cardinality {m,n}: m is min number of reps and n is max number of reps (just 1 is exact number)
regular expression operators for characters
any character: .
word boundary: \b
any digit: \d
escape sequence: e.g. \
extract slots in order:
starting search for the filler of the n+1 slot where the filler for the nth slot ended. Assume slots are in fixed order.
Make patterns specific enough to…
identify each filler always starting from the beginning of the document
if a slot has a fixed set of pre specified possible fillers…
text categorisation can be used to fill the slot (job category, company type). Treat each of possible values of slot as a category and classify the entire document to determine the correct filler.
if extracting from automatically generated web pages … usually work
simple regex patterns
if extracting from more natural, unstructured, human-written text, …. may help
some natural language processing
Sorts of NLP:
POS tagging, syntactic parsing, semantic word categories
extraction patterns can use …
POS or phrase tags (in prefillers/fillers)
rapier system learns 3 regex style patterns for each slot:
prefiller pattern, filler pattern and postfiller pattern
always eveluate IE performance on…
independent maually-annotated test data not used during system development
measure each test document with:
N (total # correct extractions in solution template)
E (total # of slot/value pairs extracted by the system)
C (# of extracted slot/value pairs that are correct)
recall
C/N
Precision
C/E
if relevant docs were all available in standardized xml format…
IE would be unnecesarry
hard to get standardized xml format because of
difficult to format, difficult to mannualy annotate docs with good XML tags, commercial industry might be reluctant to provide it
IE provides a way of…
automatically transforming semi structured or unstructured data into xml compatible format
web extraction may be aided by…
first parsing web pages into dom trees
after parsing webpages into dom trees//
extraction patterns can be specified as paths from the root of the dom tree to the node containing the text to extract
even though using domtrees, may still need..
regex patterns to idenitfy proper portion of final character data node
rest =
representational state transfer
rest means
collection of netwrok architecture principles which outline how resources are defined and addressed. Its not a standard, but uses several standarrds.
HTTP is a
communications protocol that allows retrievin interlinked text documents (www)
motivation for rest was…
to caputre characteristic of web that made web succesfull (make request, http protocol, URI)
maiin concepts rest
nouns (resources) -> unconstrained (full website)
verbs -> constraint (GET)
representations -> constrained (XML
a resource is
conceptual mapping to a set of entities represented with global identifier
verbs:
represent actions to be performed on resources
http get
how clients asked for info theyseek
http puy
updates a resources
http post
creates resource
http delete
removes resource identified by uri
representations:
how data is represented/returned to the client for presentation (javascript or xml, can be multiple)
REST name because
client application changes/transfers state with each resource representation
javascript
html to define content of web pages
css to specify layout of webpagesjavascript to program behavior of webpages
commonuses java script
form validation, page embellishments and special effects, dynamic content manipulation, emerging web 2.0
ajax characs
increased responsivess and interactiveness of webpages.
exchanging small amounts of data with server
entire web page doesnt have to be reloaded each time user performs action
ajax is not
a technology itself but a term to refer to use of group of technologies
core and defining element of ajax is
xmlhttprequest object (page doesnt need to refresh)
elaborate characs ajax:
user driven,
views defined by urls,
simple user interaction model
synchronous interacton
components ajax interaction
client event occurs
an xmlhttprequest is created and configured
asynchronous request made to server via xmlhttprequest object
server processes request and returns data, client executes a callback in the xmlhttprequest object
html dom updated based on response data
dom:
document object model, platform and language independent way to represent xml
ajax dangers
hyp
application development/maintenance cost
behavior not weblike
security issuesp
parallel/distributed processing of data
sisd, simd, misd, mimd (between control unit and processor unit)
sisd
single instruction stream, single datastream ->serial procoessor
simd
single instruction stream, multiple data stream -> array processor
mimd
multiple instruction stream multiple data stream -> multiprocessor or multicomputer
misd
multiple instruction stream, single data stream -> no examples
a distrributed algo is an algo in which…
proccessors cant determine the state of the other processors (need message)
parallel algos are a subset of
distributed algos, but they can determine state of other processors
efficiency is a measure of..
fraction of time that a processor spends on perfoming useful work
what is a regular expression
a regular expression is a special string used to define string patterns
role of prefiller and post filler patterns in info extraction
they are responsible to defining the context in which the filler patterns operates
how to you fill slots that take values only from a fixed prespecified set of fillers (which dont necessarily appear in text)?
text categorisation is usually used. You need to treat each of the possible values of the slot as a category and classify the entire document to determine the correct filler