Exam Flashcards

1
Q

SQL in MR: SELECT name, salary FROM employees WHERE age < 40

A

Map filters based on age; Reduce doesn’t have to do anything.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

Relation –> Stream

A

Istream(R): contain all tuples in R that are new within the last time period (insert stream). Dstream(R): contains all tuples in R which where in the stram before the last period (and not anymore now) delete stream. Rstream(R): contains all tuples in R.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

SQL in MR: SELECT name, AVG(contacts) FROM facebookTable GROUP BY name

A

Map emits data with Key = name Reduce computes the average.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are n-grams and for what are they used for?

A

N-gram are variable-lenght word sequences which have many application in field including information retrieval, natural lnaguage processing and digital humanities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is eventual consistency?

A

If no new updates happen, eventually all access to that item will return the last update value. We call REPLICA CONVERGENCE the system that has achieved eventual consistency. To resolve conflicts, system must reconcile differences between multiple copies of distributed data (broadcast, anti-entropy or rumor spreading). Use of timestamps and vector clocks to detect concurrency between updates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Four architectures issues related to MR.

A

1) Distributed file system 2) Big chunks of data to process (64MB, 128MB) 3) Chunks replicated and distributed 4) Data processing is moved to data (when possible)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

MR example: Inverted Index

A

Similar to WordCount map(str key, str file) for each word w in file emit(w,file.id) reduce(str key, list values) str return = “” // List of all doc IDs where the key occurs for each value v in values return += v emit(result)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are data stream synopses?

A

Synopses (or estimators) consice represent the stream content- They are tailored to tasks (couting distinct elements, eg) Usually not exact, but approximation of true values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Relation –> Relation?

A

With the window got from Stream –> Relation, we get a relation where we can apply any query expressed in SQL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Graph DBs

A

Data model: vertices and edges Queries: shortest path, connected components, followers Neo4J, GraphBase HigherPerformance than self-joins

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the CAP theorem?

A

The CAP Theorem says that it is impossible for a distributed computer system to simultaneously provide all three guarantees: - Consistency: all nodes have the same data at the same time; - Availability: a guarantee that every request receives a response about whether it was successful or failed - Partition tolerance: system continues to operate despite arbitrary message loss or failure of part of the system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Read/Write e Write/Write conflicts:

A

Conflicting operations on the same record. Two kinds: - read/write: one TA wants to read a record and a second TA wants to read. - write/write two TAs want to write to the same record.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can we compute DB-style queries in a data stream?

A

Main idea: keep a window that render a continuous (infinite) stream. Sliding windows: focus attention to latest values of stream. Allows computation of aggregates. Joins are computed across windows overlaid of other (or same) stream.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are stragglers and how they can be solved?

A

Stragglers are slow nodes. Imagine that from 1000 nodes, only on is slow and delays overall response time. We can solve stragglers with speculative execution: run the same task on more than one node and the node who finishes first wins. This performance improvement comes at the cost of higher cluster utilization (the nodes who “lost” will be wast of resources).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

MR example: co-occurrences - Given text file, how many times a,b occur together? SOLUTION 2

A

Solution 2 Group together pairs into a associative array a -> {b:1, c:2, d:5, e:3, f:2} The mapper takes a sentence, generate all co-occuring term pairs and for each term, emit a->{…}. The reducer perform elment-wise sum of associative arrays. map(docid a, doc d) for all term w in d do H = new AssociativeArray() for all term u in Neighbors(w) do H(u) = H(u) + 1 emit(w,H) //term, stripe reduce(term w, stripes[H1, H2, H3, …]) Hf = new AssociativeArray() for all stripe H in stripes do sum(Hf, H) //element wise sum emit(term w, strip Hf) + far less sorting and shuffling of key values pairs + Make better use of combiners - more difficult to implement - underlying object more heavy weight

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the main idea behind FM-Sketch?

A

Hash values from a collection into a binary string, use patterns in those strings as an indicator for the number of distinct values in that collection (bit-pattern observables), than use stochastic averaging to combine m trials into a better estimate. B[0] is set approximately n/2 times B[1] is set approximately n/4 times B[i] = 0 if i >> log2(n) = 1 if i << log2(n) Mix of 1s and 0s around i ~ log2(n) To improve the sketch, B = Bs OR Bt, where S and T are the binary string corresponding to their respective streams.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain the Map Side Join and give a pseudocode for both map and reduce functions.

A

In the map side join, one relation has to fit in memory. The smaller table is replicated to each node and loaded to the memory. The join happens at map side without reduce involvement, which significantly speepds up the process since this avoids shuffling all data across the network. Smaller table can be populated to a hash-table so look-up by ID can be done. map(K table, V rec) //Gets smaller table records having this ID list recs = lookup(rec.ID) for each small_table_rec in recs joined_rec = join(small_table_rec, rec) emit(rec.ID, joined_rec) If the smaller table doesn’t fit the memory, a even smaller data set can be derived if a filtering expression has been specified in the query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

NoSQL Five Points

A
  • No one-size-fits-all database - Non-relational model; CRUD - Designed for distributed scale-out architecture - No or little schema - Mainly not full ACID support; instead: BASE
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does Relations and Stream work together?

A

We have a discrete, ordered time domain T. A relation R is a mapping from time T to a bag of tuples belonging to the schema of R. A stream is a set elements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data Replication

A

Replicate each object N times. Store data at another node - successors of node physically distinct (not virtual nodes!) Routing (in order to access the other nodes): - naive: each node knows its neighbor. Send a message to the nearest neighbor, getting close to target node with each hop. SLOWWW O(n) -Logarithm cost: lookup table with exponential increasing distances O(log n).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which are the main types of sliding windows? Are there any others?

A

Time based: can have arbitrary size (all recs that fits in a slide with time range t). Tweets from the last 10 minutes. Count based: deterministic size. Window contain at any time a fixed amount of items (100 tweets). Nearly arriving item kick out older ones. -> We can move on certain ticks -> We can reset when full

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

MR example: Word Count

A

map(str key, str value) for each word w in value emit(w,1) reduce(str key, list values) int result = 0 for each value v in values result += v emit(result)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Bloom Filter and how it can be used in the Semi-Join?

A

A Bloom Filter is a construct which can be used to test the containment of a given element in a set. We have a bit array (size m) set initially to 0, and we encode elements in that array. Hash element to bucket number and set the bit to 1. Use multiple hash functions hi. Test: is x contained in the set (= filter)? Check if both bits h1(x) and h2(x) are set to one. A smaller representation of filtered ids can be derived if id values can be augmented in to a bloom filter. Then this bloom filter can be replicated at each node. At the map side, for each record fetched from the smaller table, the bloom filter can be used to check whether the id in the record is presented in the bloom filter. Only if so we emit that particular record to reduce side. Because a bloom filter guarantees not to provide false negatives, the result would be accurate. If the bloom filter provides a false positive, is no big problem - that extra record will have no join pair.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Data Streaming Query processing: Push versus Pull.

A

Pull: consuming operator actively retrives results of producer. Kind of iterator: open, next, close. Push: producer push results to consumer. Operators register at other operators. When new tuples are generated, they are actively pushed to registered operator.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Merkle Tree

A

Parent node is hash of its children. Hierarchical checking of data integrity. Comparison: startin root: if the same hash, stop (all children will be the same). Otherwise: for nodes with different hash, go down to children. Eventually will found different data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Document Store

A

Store JSON or XML docs mongoDB, couchDB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Conflict prevention x detection

A

Prevention: - (Distributed) lock algorithms - Pessimistic approach - Assume things will go wrong and prevent from happening Detection: - “Timestamp” keep multiple versions - Optimistic - Resolve when actually happened

20
Q

MR X DB: data size, access, updates, structure, integrity, scale:

A

DB: GB, interactive & batch, Read/Write many times, static schema, high, non-linear. MR: PB, batch, Write once, read many, dynamic schema, low, linear.

21
Q

Conflict resolution in Eventual Consistency

A

Read from all: - Write to one, read from all W = 1 / R = N - Guarantees to see the last version - Write optimized; strong consistency Write to all: - Write to all, read from one W = N / R = 1 - Guarantees to see at least one recent version - Read optimized; strong consistency W + R N - Strong consistency Read will see at least one most recent write

22
Q

What is HDFS and what is it for?

A

HDFS is the Hadoop Distributed File System. Data in a Hadoop cluster is broken down into smaller pieces (blocks) and distributed over the cluster. This way, map and reduce functions can be executed on smaller subset of your larger data sets, and this provides the scalability that is needed for big data processing. Replication is pipelined through data noes.

23
Q

MR example: GREP - Given a file, return all lines that contain certain pattern.

A

MAP ONLY TASK. map(str key, str value) if(value.contains(pattern)) emit(value, “”)

24
Q

Why Partition tolerance is a strict requirement?

A

Because to process large data sets, we need (distributed) scale-out architectures, and these architectures tend to have partition between the nodes because the commodity hardware.

24
Q

Overview of Amazon’s Dynamo

A

Key/value store; CRUD per key basis High availability - never reject a write. Lead to different versions of data (partition/concurrent writes) Vector clocks: reconcile conflicted read/write Consistency Hashing: data placement

25
Q

Consistent Hashing

A
  • Use of hash function to place data to machines - Only local data movement when machines are added/removed - Local Balancing: strong machines can get larger share (multiple virtual servers) - Properties: Balance: with high probability, each bucket get the same subset of item. 10 machines, 100 inserts, 10 data for each machine. Monotonicity: if a new bucket (node) is added, an item might move from an old bucket to a new one, but never from an old one to another old one.
26
Q

What is the basically workflow in MR?

A

A MASTER NODE controls computation. It receives the submited job (task). COmputes necessary map and reduce task and select and activates the worker nodes. The map WORKER NODES if possible, were selected close to the data. The reduce consumes intermediate results and creates final output.

26
Q

CAP theorem proof idea

A

Consider a system with multiple partitions. A failure prevents sync between node 1 and node 2. What now? Prohibit read until synced violates availability; let clients read violate consistency.

28
Q

DBMS vs DSMS: data, access, queries, storage, order, update rate, time-requirements, accuracy-

A

Data Persistent (relations) | Volatile Streams Access | Random | Sequential Queries | One Time | Continuous Storage | Unlimited secondary Storage | Limited main memory Order | Only current state | Consider order of input Update rate | Relatively Low | Extrem High Time Requirements | Little or none | Near real-time Accuracy | exact Data | outdated/inaccurate

29
Q

Column store

A

Logically look like RDB tables Physically organized in a per column fashion Good For analytical tasks over subsets of columns Dynamic schema, sparse data HBase, BigTable, Cassandra

31
Q

MR example: naive n-gram count and a priori based.

A

N-gram that occur Y times and consist of at most Z words. N-gram count One single job, less memory consumption. Easy. map(did, content) for K in for all K-grams in content: emit(k-gram, did) reduce (n-gram, list) if length (list) >= Y emit(n-gram, length(list)) A priori principle: K-gram can occur more than Y times if its constituent K-1-grams occur at least Y times. Iterative implementation: First, 1-grams that occur Y times, Secondly, 2-grams that occur Y times, … Multiple Map Reduce rounds.

33
Q

What is the difference between scale-out and scale-up distributed architectures?

A

Scale-out: means to have many small, commodity machines (hundreads, thousands). They are cheap but not reliable - failures will happen. Scale-up: replace a machine by a bigger, stronger (more powerful) machine.

34
Q

Pseudo code for Wordcount MR with function types:

A

Map(k1, v1) -> (k2, v2) Reduce(k2, list(v2)) -> list(v2) k1 = doc identifier k2 = term v1 = doc content v2 = count k3 = term v3 = final count

36
Q

What is the key idea behind MapReduce?

A

Spread the task of processing data on multiple machines according to a map and reduce functions. The framework simplifies parallel programs because it deal with node failures, load balancing, etc. In the map phase, data is put to a number of machines and the output is partitioned (sorted) by a key. In the reduce phase, for each key-group, data is aggregated.

37
Q

Stream –> Relation?

A

Window specification. Three ways to construct: - Time-based: S[RANGE] 30 seconds, now; - Tuple-based: S[ROWS N] Rows 1 - Partitioned: S[PARTITION By A1..Ak ROWS N]. Logically partitionS into substreams. Kind of Group By.

38
Q

What is storm and how does it work?

A

Storm is a fault-tolerant, distributed stream processing system. It uses custom created “spouts” and bolts to define information sources and manipulation to allow batch, distributed processing of streaming data. - Spout: data source. Twitter Stream. - Bolts: operators that consume output of spouts or other bolts (filter stopwords) - Topology: query plan. By connecting spouts and bolts, determines the data flow. - Trident: high-level abstraction on top of storm.

39
Q

What is Pig? Compare it to RDBMS.

A

High level tools for expressing data analysis programs. The compiler transforms query into sequence of MR jobs. Pig Latin x SQL - Pig latim is a data flow programming language. User specified operations put together to achieve a task. - SQL is declarative: user specifies what the result should be (not how it is implemented) Pig x RDBMS - RDBMS: tables with predefined schema. Support Transactions and indices. Aim fast response time. - Pig: schema at runtime (even optional). Any source. There is no loading indexing of data as pre-processing: data is loaded at execution time. Aim throughput, not super fast short queries.

41
Q

What is CQL?

A

Is the declarative query language to phrase contiuous queries, SQL like. Include streams, windows, new semantics (three relation-to-stream operators: IStream, DStream and RStream), sampling.

42
Q

Explain the Reduce Side Join and give a pseudo code for both map and reduce functions.

A

Map is responsible for emitting the join predicate values along with the corresponding record from each table, so that records having the same id in both tables will end up at the same reducer. The reducer will then do the join of the records having the same id. It is also necessary to TAG each record to indicate from which table the record originated, so that joining happens between records of two tables. map(K table, V rec) id = rec.id tagged_rec.tag = table tagged_rec.tag = rec emit(id, tagged_rec) reduce(K id, list tagged_recs) for each tg_rec1 in tagged_recs if tg_rec1.tag = R for each tg_rec2 in tagged_recs if tg_rec2.tag = S emit(tg_rec1.id, joined_rec)

44
Q

Give an overview of Query Execution in STREAM (Stanford DSMS)?

A

When a continuous query is registered, generate a query execution plan. New plan can be merged with existing plans, users can also create & manipulate plan directly. Plans are composed of three main components: Operators, queues (input and inter-operator), state (windows, operators requiring history). Global scheduler for plan execution.

45
Q

How can we process a STREAM query plan distributed?

A

To process it distributed, we just have to care about the communication! Check dependencies.

47
Q

What is and when to use a custom partitioner?

A

When you have a composite key () you can write a custom partitioner that considers k1 a partition and sort comparator for sorting by k2. This leads to a second problem: reducer still consumes groups by K within correct partition. The solution if to define a custom grouping method that considers K2 for grouping.

48
Q

MR example: co-occurrences - Given text file, how many times a,b occur together? SOLUTION 1

A

Expected result: ([a,b], count) M = N x N (N vocabulary size) Mij = number of times i and j co-occur in some context Solution 1 map(str key, str file) for each word w1 in file for each word w2 in Neighbors(w1) emit([w1,w2], 1) reduce(str key, list values) int result for each value v in values result += v emit(key, result) + Easy to implement + Easy to understand - lot of pair to sort and shuffle - no combiners

49
Q

What is the default shuffle and sort behaviors?

A

Output is map is partitioned by key. Reducer is guaranteed to get entire partition. Output of reducer is also sorted by key. The chosen key affects the partition and sort order.

50
Q

Eventual Consistency Synchronization Process:

A

Given N nodes (replicas), each of them might or might not have the recent value of an object. Communication between nodes has to ensure consistent view on data (replicas). Two solutions: Naive: - Broadcast. Robust solution but inefficient. Too many messages. - Epidemic Algorithms (Gossips) Anti-entropy: info is constantly exchanged with randomly selected node. Always exchange the current versions items stored in the nodes. Do that continuously. Rumor spreading: info is exchanged with randomly chosen nodes, multiple rounds, then stop. With high probability, data is consistently replicated afterwards. Push, Pull and Push/Pull.

51
Q

How does the K-min estimator works?

A

Supposing we have a good hash function, the hash values will evenly distributed across the hash space (say [0 .. 1]). You could estimate the number of distinct values you have seen by knowing the average spacing between values in the hash space. For 10 distinct values, the average space is 1/10. You could do this cheaper by keeping track of only the smallest value. However, tracking only one values open you up a ton of variance, and you became dependent on how “good” your hash function is. To improve it, we keep track of the K smallest values. Estimation = K - 1 / Kmax = 3 - 1 / 0.3 = 6.7. Unions are “lossless”: merely take 2 sketches and combine their values and keep the K-smallest ones.

52
Q

What is the combiner and what is it good for?

A

The combiner function is udes as an optimization for the MR job. The combiner function runs on the output of the map phase and it is used as filtering or an aggregating (aggregation function must be associative and commutative) step to lessen the number of intermediate keys being passed to the reducer. The combiner is not a replacement of the reducer because it sees only local information. Example: for tasks that aim computing the number of observations of a certain item (term) are beyond a threashhold (n-gram).

53
Q

What are the differences between traditional data management and data stream management?

A

Traditionally, data is periodically, loaded in store for deeper analytics. At query time, data is accessed as a whole. Queries are mainly ad-hoc- In a data stream, data is continuously moving, i.e., continuously being generated and assumed to be infinite. Queries are “standing”: registered one, observed “forever”. Answer to queries in near real-time are often required- Probabilistic methods for efficiency or because we consider only part of the stream.

55
Q

Cite the three possible scenarios for data locality in a Hadoop cluster:

A

1) Data local: map task and HDFS and block in the same node. 2) Rack local: map task and HDFS and block in the same rack. 3) Off-rack: map task in one rack and HDFS block in another.

56
Q

Key-Value Stores

A

Store Key/Value Pairs Value can be complex datatype Dynamo, Redis, Voldemort CRUD Range queries

58
Q

What is BASE? What the idea?

A

BASE means Basically Available, Soft State, Eventual Consistency. The idea is to sacrifice strong consistency to gain faster response times in a more scalable manner. - High availability for first-tier services; - background clean up mechanism; - resolve problems optimistic; when an action violated consistency

59
Q

Examples of data stream scenarios:

A

Distributed sensor networks, mobile ad-hoc networks, social sensor, stock market.

61
Q

Which are the 10 steps to execute a MR job in Hadoop?

A

1 - MR program start the Job 2 - Job get a new JobID from JobTrackers 3 - Job copy job resources to HDFS 4 - Job submit job to JobTracker 5 - JobTracker initialize the job 6 - JobTracker retrieve input splits from HDFS 7 - TaskTracker sends heart beat 8 - TaskTracker retrieve job resources 9 - TaskTracker lauches child 10 - Child runs map or reduce

62
Q

Rumor spreading: push x pull x push/pull

A
  • Push: holder of new info actively distributes it. PREDICTABLE Good when few nodes are informed (exponentialy growth) Slow to informe everyone, high probability some uninformed node won’t get called O(log n), O(n log n) #msg - Pull: people actively call to obtain new. FAST CONVERGENCE. When few nodes are informed, startup is unpredictable: an informed node might not get called. If a fraction p still uninformed in this round, then p² will remain uninformed in the next. O(log log n), O(n) #msg - Push/Pull: predictability AND fast convergence A call B to push a rumor and concurrently tries to pull it from B. O(log n) #msg O(n) //not sure…
63
Q

How to compute the PageRank in MR?

A

Notes

64
Q

MR example: breadth First Search

A

Performing computation on a graph data structure requires processing at each node. Each node contains node specific data as well as links (edges) to other nodes. Computation must traverse the graph and perform the computation step. How we traverse graph in MR? BFS is an iterated alogirthm over graphs. Frontier advances from origin by one level with each pass. MR: iterated passes throught MR - map some nodes, result includes additional nodes which are fed into successive MR passes. How do we represent a graph for this? Sending the entire graph to a (thousands of) map tasks involves an enourmous amount of memory. Need to carefully consider how we represent graphs. - direct references: objects, references from each node to its neighbors. Not easily serializable. - adjaceny matrix: Mij = ‘1’ implies a link from node i to j. Problem: full o zeros. - sparse matrix: only include non-zero elements.

65
Q

Vector Clocks

A

Assign each of your process an ID, then make sure you include that ID and the last vector clock your saw for a given value when store modification/send a message. Algorithm for generating a partial ordering of events in a distributed system and detecting causality violations.

66
Q

What is the Count-Min Sketch and how does it work?

A

Count-min sketch is a probabilistic couting algorithm. We keep a 2 dimension array (h,r). We have h hash function that map to range 0 .. (r - 1). For every element, we compute all hash values and increment in 1 the result bucket. To know how often we did see an item, we calculate the hash for it and retrieve the value of all buckets. Then take the minimum of the corresponding values. We get the minimum value because more than one item can be mapped to the same bucket, but no more than the minimum value the same item had been inserted.