Exam Flashcards
SQL in MR: SELECT name, salary FROM employees WHERE age < 40
Map filters based on age; Reduce doesn’t have to do anything.
Relation –> Stream
Istream(R): contain all tuples in R that are new within the last time period (insert stream). Dstream(R): contains all tuples in R which where in the stram before the last period (and not anymore now) delete stream. Rstream(R): contains all tuples in R.
SQL in MR: SELECT name, AVG(contacts) FROM facebookTable GROUP BY name
Map emits data with Key = name Reduce computes the average.
What are n-grams and for what are they used for?
N-gram are variable-lenght word sequences which have many application in field including information retrieval, natural lnaguage processing and digital humanities.
What is eventual consistency?
If no new updates happen, eventually all access to that item will return the last update value. We call REPLICA CONVERGENCE the system that has achieved eventual consistency. To resolve conflicts, system must reconcile differences between multiple copies of distributed data (broadcast, anti-entropy or rumor spreading). Use of timestamps and vector clocks to detect concurrency between updates.
Four architectures issues related to MR.
1) Distributed file system 2) Big chunks of data to process (64MB, 128MB) 3) Chunks replicated and distributed 4) Data processing is moved to data (when possible)
MR example: Inverted Index
Similar to WordCount map(str key, str file) for each word w in file emit(w,file.id) reduce(str key, list values) str return = “” // List of all doc IDs where the key occurs for each value v in values return += v emit(result)
What are data stream synopses?
Synopses (or estimators) consice represent the stream content- They are tailored to tasks (couting distinct elements, eg) Usually not exact, but approximation of true values.
Relation –> Relation?
With the window got from Stream –> Relation, we get a relation where we can apply any query expressed in SQL.
Graph DBs
Data model: vertices and edges Queries: shortest path, connected components, followers Neo4J, GraphBase HigherPerformance than self-joins
What is the CAP theorem?
The CAP Theorem says that it is impossible for a distributed computer system to simultaneously provide all three guarantees: - Consistency: all nodes have the same data at the same time; - Availability: a guarantee that every request receives a response about whether it was successful or failed - Partition tolerance: system continues to operate despite arbitrary message loss or failure of part of the system.
Read/Write e Write/Write conflicts:
Conflicting operations on the same record. Two kinds: - read/write: one TA wants to read a record and a second TA wants to read. - write/write two TAs want to write to the same record.
How can we compute DB-style queries in a data stream?
Main idea: keep a window that render a continuous (infinite) stream. Sliding windows: focus attention to latest values of stream. Allows computation of aggregates. Joins are computed across windows overlaid of other (or same) stream.
What are stragglers and how they can be solved?
Stragglers are slow nodes. Imagine that from 1000 nodes, only on is slow and delays overall response time. We can solve stragglers with speculative execution: run the same task on more than one node and the node who finishes first wins. This performance improvement comes at the cost of higher cluster utilization (the nodes who “lost” will be wast of resources).
MR example: co-occurrences - Given text file, how many times a,b occur together? SOLUTION 2
Solution 2 Group together pairs into a associative array a -> {b:1, c:2, d:5, e:3, f:2} The mapper takes a sentence, generate all co-occuring term pairs and for each term, emit a->{…}. The reducer perform elment-wise sum of associative arrays. map(docid a, doc d) for all term w in d do H = new AssociativeArray() for all term u in Neighbors(w) do H(u) = H(u) + 1 emit(w,H) //term, stripe reduce(term w, stripes[H1, H2, H3, …]) Hf = new AssociativeArray() for all stripe H in stripes do sum(Hf, H) //element wise sum emit(term w, strip Hf) + far less sorting and shuffling of key values pairs + Make better use of combiners - more difficult to implement - underlying object more heavy weight
What is the main idea behind FM-Sketch?
Hash values from a collection into a binary string, use patterns in those strings as an indicator for the number of distinct values in that collection (bit-pattern observables), than use stochastic averaging to combine m trials into a better estimate. B[0] is set approximately n/2 times B[1] is set approximately n/4 times B[i] = 0 if i >> log2(n) = 1 if i << log2(n) Mix of 1s and 0s around i ~ log2(n) To improve the sketch, B = Bs OR Bt, where S and T are the binary string corresponding to their respective streams.
Explain the Map Side Join and give a pseudocode for both map and reduce functions.
In the map side join, one relation has to fit in memory. The smaller table is replicated to each node and loaded to the memory. The join happens at map side without reduce involvement, which significantly speepds up the process since this avoids shuffling all data across the network. Smaller table can be populated to a hash-table so look-up by ID can be done. map(K table, V rec) //Gets smaller table records having this ID list recs = lookup(rec.ID) for each small_table_rec in recs joined_rec = join(small_table_rec, rec) emit(rec.ID, joined_rec) If the smaller table doesn’t fit the memory, a even smaller data set can be derived if a filtering expression has been specified in the query.
NoSQL Five Points
- No one-size-fits-all database - Non-relational model; CRUD - Designed for distributed scale-out architecture - No or little schema - Mainly not full ACID support; instead: BASE
How does Relations and Stream work together?
We have a discrete, ordered time domain T. A relation R is a mapping from time T to a bag of tuples belonging to the schema of R. A stream is a set elements.
Data Replication
Replicate each object N times. Store data at another node - successors of node physically distinct (not virtual nodes!) Routing (in order to access the other nodes): - naive: each node knows its neighbor. Send a message to the nearest neighbor, getting close to target node with each hop. SLOWWW O(n) -Logarithm cost: lookup table with exponential increasing distances O(log n).
Which are the main types of sliding windows? Are there any others?
Time based: can have arbitrary size (all recs that fits in a slide with time range t). Tweets from the last 10 minutes. Count based: deterministic size. Window contain at any time a fixed amount of items (100 tweets). Nearly arriving item kick out older ones. -> We can move on certain ticks -> We can reset when full
MR example: Word Count
map(str key, str value) for each word w in value emit(w,1) reduce(str key, list values) int result = 0 for each value v in values result += v emit(result)
What is Bloom Filter and how it can be used in the Semi-Join?
A Bloom Filter is a construct which can be used to test the containment of a given element in a set. We have a bit array (size m) set initially to 0, and we encode elements in that array. Hash element to bucket number and set the bit to 1. Use multiple hash functions hi. Test: is x contained in the set (= filter)? Check if both bits h1(x) and h2(x) are set to one. A smaller representation of filtered ids can be derived if id values can be augmented in to a bloom filter. Then this bloom filter can be replicated at each node. At the map side, for each record fetched from the smaller table, the bloom filter can be used to check whether the id in the record is presented in the bloom filter. Only if so we emit that particular record to reduce side. Because a bloom filter guarantees not to provide false negatives, the result would be accurate. If the bloom filter provides a false positive, is no big problem - that extra record will have no join pair.
Data Streaming Query processing: Push versus Pull.
Pull: consuming operator actively retrives results of producer. Kind of iterator: open, next, close. Push: producer push results to consumer. Operators register at other operators. When new tuples are generated, they are actively pushed to registered operator.
Merkle Tree
Parent node is hash of its children. Hierarchical checking of data integrity. Comparison: startin root: if the same hash, stop (all children will be the same). Otherwise: for nodes with different hash, go down to children. Eventually will found different data.
Document Store
Store JSON or XML docs mongoDB, couchDB