week 4: big data Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

what is big data

A

large amjount of user generated content (UGC)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

challenges big data

A

storaging, processing, transfering from storage to computing nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

5 big data characteristics

A

volume (data size)

variety: different formats
velocity: speed of change
veracity: uncertainty of data
value: turn data into value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

let the data…

A

speak

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

dont be fixed on.. but discover..

A

causality … patterns & correlations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

in big data, no need for..

A

sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

info extraction:

A

identify specific pieces of info (data) in an unstructured or semi structured textual document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

transform unstructured info in a corpus of different documents or web pages into…

A

a structured data base

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

many web pages are generated automatically from…

A

an underlying database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

HTML structure of pages is..

A

failry specific and regular (semi structured)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

wrapper =

A

extractor for a semi structured website

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

screen scraping =

A

process of extracting from html pages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

other process to extract info from html pages than screen scraping

A

API

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

slots in template are typically filled by…

A

substring from the document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

some slots may have…

A

a fixed set of prespecified possible fillers that may not occur

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

some slots may allow…

A

multiple fillers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

some domains may allow…

A

multiple extracted templates per document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

specifying an item to extract for a slot using a regular expression pattern may require

A

prefiller/preceding pattern to identify proper context & succeeding/postfiller pattern to identify the end of the filler

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

regular expression operators:

A
or: |
grouping ()
repititon zero or more: *
repetition one or more: +
repetition zero or one(optional?
sequencing: order of elements
cardinality {m,n}: m is min number of reps and n is max number of reps (just 1 is exact number)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

regular expression operators for characters

A

any character: .
word boundary: \b
any digit: \d
escape sequence: e.g. \

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

extract slots in order:

A

starting search for the filler of the n+1 slot where the filler for the nth slot ended. Assume slots are in fixed order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Make patterns specific enough to…

A

identify each filler always starting from the beginning of the document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

if a slot has a fixed set of pre specified possible fillers…

A

text categorisation can be used to fill the slot (job category, company type). Treat each of possible values of slot as a category and classify the entire document to determine the correct filler.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

if extracting from automatically generated web pages … usually work

A

simple regex patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

if extracting from more natural, unstructured, human-written text, …. may help

A

some natural language processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Sorts of NLP:

A

POS tagging, syntactic parsing, semantic word categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

extraction patterns can use …

A

POS or phrase tags (in prefillers/fillers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

rapier system learns 3 regex style patterns for each slot:

A

prefiller pattern, filler pattern and postfiller pattern

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

always eveluate IE performance on…

A

independent maually-annotated test data not used during system development

30
Q

measure each test document with:

A

N (total # correct extractions in solution template)
E (total # of slot/value pairs extracted by the system)
C (# of extracted slot/value pairs that are correct)

31
Q

recall

A

C/N

32
Q

Precision

A

C/E

33
Q

if relevant docs were all available in standardized xml format…

A

IE would be unnecesarry

34
Q

hard to get standardized xml format because of

A

difficult to format, difficult to mannualy annotate docs with good XML tags, commercial industry might be reluctant to provide it

35
Q

IE provides a way of…

A

automatically transforming semi structured or unstructured data into xml compatible format

36
Q

web extraction may be aided by…

A

first parsing web pages into dom trees

37
Q

after parsing webpages into dom trees//

A

extraction patterns can be specified as paths from the root of the dom tree to the node containing the text to extract

38
Q

even though using domtrees, may still need..

A

regex patterns to idenitfy proper portion of final character data node

39
Q

rest =

A

representational state transfer

40
Q

rest means

A

collection of netwrok architecture principles which outline how resources are defined and addressed. Its not a standard, but uses several standarrds.

41
Q

HTTP is a

A

communications protocol that allows retrievin interlinked text documents (www)

42
Q

motivation for rest was…

A

to caputre characteristic of web that made web succesfull (make request, http protocol, URI)

43
Q

maiin concepts rest

A

nouns (resources) -> unconstrained (full website)
verbs -> constraint (GET)
representations -> constrained (XML

44
Q

a resource is

A

conceptual mapping to a set of entities represented with global identifier

45
Q

verbs:

A

represent actions to be performed on resources

46
Q

http get

A

how clients asked for info theyseek

47
Q

http puy

A

updates a resources

48
Q

http post

A

creates resource

49
Q

http delete

A

removes resource identified by uri

50
Q

representations:

A

how data is represented/returned to the client for presentation (javascript or xml, can be multiple)

51
Q

REST name because

A

client application changes/transfers state with each resource representation

52
Q

javascript

A

html to define content of web pages

css to specify layout of webpagesjavascript to program behavior of webpages

53
Q

commonuses java script

A

form validation, page embellishments and special effects, dynamic content manipulation, emerging web 2.0

54
Q

ajax characs

A

increased responsivess and interactiveness of webpages.
exchanging small amounts of data with server
entire web page doesnt have to be reloaded each time user performs action

55
Q

ajax is not

A

a technology itself but a term to refer to use of group of technologies

56
Q

core and defining element of ajax is

A

xmlhttprequest object (page doesnt need to refresh)

57
Q

elaborate characs ajax:

A

user driven,
views defined by urls,
simple user interaction model
synchronous interacton

58
Q

components ajax interaction

A

client event occurs
an xmlhttprequest is created and configured
asynchronous request made to server via xmlhttprequest object
server processes request and returns data, client executes a callback in the xmlhttprequest object
html dom updated based on response data

59
Q

dom:

A

document object model, platform and language independent way to represent xml

60
Q

ajax dangers

A

hyp
application development/maintenance cost
behavior not weblike
security issuesp

61
Q

parallel/distributed processing of data

A

sisd, simd, misd, mimd (between control unit and processor unit)

62
Q

sisd

A

single instruction stream, single datastream ->serial procoessor

63
Q

simd

A

single instruction stream, multiple data stream -> array processor

64
Q

mimd

A

multiple instruction stream multiple data stream -> multiprocessor or multicomputer

65
Q

misd

A

multiple instruction stream, single data stream -> no examples

66
Q

a distrributed algo is an algo in which…

A

proccessors cant determine the state of the other processors (need message)

67
Q

parallel algos are a subset of

A

distributed algos, but they can determine state of other processors

68
Q

efficiency is a measure of..

A

fraction of time that a processor spends on perfoming useful work

69
Q

what is a regular expression

A

a regular expression is a special string used to define string patterns

70
Q

role of prefiller and post filler patterns in info extraction

A

they are responsible to defining the context in which the filler patterns operates

71
Q

how to you fill slots that take values only from a fixed prespecified set of fillers (which dont necessarily appear in text)?

A

text categorisation is usually used. You need to treat each of the possible values of the slot as a category and classify the entire document to determine the correct filler