Exam Flashcards

Question

Merkle Tree

Answer 1

Parent node is hash of its children. Hierarchical checking of data integrity. Comparison: startin root: if the same hash, stop (all children will be the same). Otherwise: for nodes with different hash, go down to children. Eventually will found different data.

Answer 2

Store JSON or XML docs mongoDB, couchDB

Answer 3

Prevention: - (Distributed) lock algorithms - Pessimistic approach - Assume things will go wrong and prevent from happening Detection: - "Timestamp" keep multiple versions - Optimistic - Resolve when actually happened

Answer 4

DB: GB, interactive & batch, Read/Write many times, static schema, high, non-linear. MR: PB, batch, Write once, read many, dynamic schema, low, linear.

Answer 5

Read from all: - Write to one, read from all W = 1 / R = N - Guarantees to see the last version - Write optimized; strong consistency Write to all: - Write to all, read from one W = N / R = 1 - Guarantees to see at least one recent version - Read optimized; strong consistency W + R N - Strong consistency Read will see at least one most recent write

Answer 6

HDFS is the Hadoop Distributed File System. Data in a Hadoop cluster is broken down into smaller pieces (blocks) and distributed over the cluster. This way, map and reduce functions can be executed on smaller subset of your larger data sets, and this provides the scalability that is needed for big data processing. Replication is pipelined through data noes.

Answer 7

MAP ONLY TASK. map(str key, str value) if(value.contains(pattern)) emit(value, "")

Answer 8

Because to process large data sets, we need (distributed) scale-out architectures, and these architectures tend to have partition between the nodes because the commodity hardware.

Answer 9

Key/value store; CRUD per key basis High availability - never reject a write. Lead to different versions of data (partition/concurrent writes) Vector clocks: reconcile conflicted read/write Consistency Hashing: data placement

Answer 10

- Use of hash function to place data to machines - Only local data movement when machines are added/removed - Local Balancing: strong machines can get larger share (multiple virtual servers) - Properties: Balance: with high probability, each bucket get the same subset of item. 10 machines, 100 inserts, 10 data for each machine. Monotonicity: if a new bucket (node) is added, an item might move from an old bucket to a new one, but never from an old one to another old one.

Answer 11

A MASTER NODE controls computation. It receives the submited job (task). COmputes necessary map and reduce task and select and activates the worker nodes. The map WORKER NODES if possible, were selected close to the data. The reduce consumes intermediate results and creates final output.

Answer 12

Consider a system with multiple partitions. A failure prevents sync between node 1 and node 2. What now? Prohibit read until synced violates availability; let clients read violate consistency.

Answer 13

Answer 14

Logically look like RDB tables Physically organized in a per column fashion Good For analytical tasks over subsets of columns Dynamic schema, sparse data HBase, BigTable, Cassandra

Answer 15

N-gram that occur Y times and consist of at most Z words. N-gram count One single job, less memory consumption. Easy. map(did, content) for K in for all K-grams in content: emit(k-gram, did) reduce (n-gram, list) if length (list) \>= Y emit(n-gram, length(list)) A priori principle: K-gram can occur more than Y times if its constituent K-1-grams occur at least Y times. Iterative implementation: First, 1-grams that occur Y times, Secondly, 2-grams that occur Y times, ... Multiple Map Reduce rounds.

Answer 16

Scale-out: means to have many small, commodity machines (hundreads, thousands). They are cheap but not reliable - failures will happen. Scale-up: replace a machine by a bigger, stronger (more powerful) machine.

Answer 17

Map(k1, v1) -\> (k2, v2) Reduce(k2, list(v2)) -\> list(v2) k1 = doc identifier k2 = term v1 = doc content v2 = count k3 = term v3 = final count

Answer 18

Spread the task of processing data on multiple machines according to a map and reduce functions. The framework simplifies parallel programs because it deal with node failures, load balancing, etc. In the map phase, data is put to a number of machines and the output is partitioned (sorted) by a key. In the reduce phase, for each key-group, data is aggregated.

Answer 19

Window specification. Three ways to construct: - Time-based: S[RANGE] 30 seconds, now; - Tuple-based: S[ROWS N] Rows 1 - Partitioned: S[PARTITION By A1..Ak ROWS N]. Logically partitionS into substreams. Kind of Group By.

Answer 20

Storm is a fault-tolerant, distributed stream processing system. It uses custom created "spouts" and bolts to define information sources and manipulation to allow batch, distributed processing of streaming data. - Spout: data source. Twitter Stream. - Bolts: operators that consume output of spouts or other bolts (filter stopwords) - Topology: query plan. By connecting spouts and bolts, determines the data flow. - Trident: high-level abstraction on top of storm.

Answer 21

High level tools for expressing data analysis programs. The compiler transforms query into sequence of MR jobs. Pig Latin x SQL - Pig latim is a data flow programming language. User specified operations put together to achieve a task. - SQL is declarative: user specifies what the result should be (not how it is implemented) Pig x RDBMS - RDBMS: tables with predefined schema. Support Transactions and indices. Aim fast response time. - Pig: schema at runtime (even optional). Any source. There is no loading indexing of data as pre-processing: data is loaded at execution time. Aim throughput, not super fast short queries.

Answer 22

Is the declarative query language to phrase contiuous queries, SQL like. Include streams, windows, new semantics (three relation-to-stream operators: IStream, DStream and RStream), sampling.

Answer 23

Map is responsible for emitting the join predicate values along with the corresponding record from each table, so that records having the same id in both tables will end up at the same reducer. The reducer will then do the join of the records having the same id. It is also necessary to TAG each record to indicate from which table the record originated, so that joining happens between records of two tables. map(K table, V rec) id = rec.id tagged\_rec.tag = table tagged\_rec.tag = rec emit(id, tagged\_rec) reduce(K id, list tagged\_recs) for each tg\_rec1 in tagged\_recs if tg\_rec1.tag = R for each tg\_rec2 in tagged\_recs if tg\_rec2.tag = S emit(tg\_rec1.id, joined\_rec)

Answer 24

When a continuous query is registered, generate a query execution plan. New plan can be merged with existing plans, users can also create & manipulate plan directly. Plans are composed of three main components: Operators, queues (input and inter-operator), state (windows, operators requiring history). Global scheduler for plan execution.

Answer 25

To process it distributed, we just have to care about the communication! Check dependencies.

Answer 26

When you have a composite key () you can write a custom partitioner that considers k1 a partition and sort comparator for sorting by k2. This leads to a second problem: reducer still consumes groups by K within correct partition. The solution if to define a custom grouping method that considers K2 for grouping.

Answer 27

Expected result: ([a,b], count) M = N x N (N vocabulary size) Mij = number of times i and j co-occur in some context Solution 1 map(str key, str file) for each word w1 in file for each word w2 in Neighbors(w1) emit([w1,w2], 1) reduce(str key, list values) int result for each value v in values result += v emit(key, result) + Easy to implement + Easy to understand - lot of pair to sort and shuffle - no combiners

Answer 28

Output is map is partitioned by key. Reducer is guaranteed to get entire partition. Output of reducer is also sorted by key. The chosen key affects the partition and sort order.

Answer 29

Given N nodes (replicas), each of them might or might not have the recent value of an object. Communication between nodes has to ensure consistent view on data (replicas). Two solutions: Naive: - Broadcast. Robust solution but inefficient. Too many messages. - Epidemic Algorithms (Gossips) Anti-entropy: info is constantly exchanged with randomly selected node. Always exchange the current versions items stored in the nodes. Do that continuously. Rumor spreading: info is exchanged with randomly chosen nodes, multiple rounds, then stop. With high probability, data is consistently replicated afterwards. Push, Pull and Push/Pull.

Answer 30

Supposing we have a good hash function, the hash values will evenly distributed across the hash space (say [0 .. 1]). You could estimate the number of distinct values you have seen by knowing the average spacing between values in the hash space. For 10 distinct values, the average space is 1/10. You could do this cheaper by keeping track of only the smallest value. However, tracking only one values open you up a ton of variance, and you became dependent on how "good" your hash function is. To improve it, we keep track of the K smallest values. Estimation = K - 1 / Kmax = 3 - 1 / 0.3 = 6.7. Unions are "lossless": merely take 2 sketches and combine their values and keep the K-smallest ones.

Answer 31

The combiner function is udes as an optimization for the MR job. The combiner function runs on the output of the map phase and it is used as filtering or an aggregating (aggregation function must be associative and commutative) step to lessen the number of intermediate keys being passed to the reducer. The combiner is not a replacement of the reducer because it sees only local information. Example: for tasks that aim computing the number of observations of a certain item (term) are beyond a threashhold (n-gram).

Answer 32

Traditionally, data is periodically, loaded in store for deeper analytics. At query time, data is accessed as a whole. Queries are mainly ad-hoc- In a data stream, data is continuously moving, i.e., continuously being generated and assumed to be infinite. Queries are "standing": registered one, observed "forever". Answer to queries in near real-time are often required- Probabilistic methods for efficiency or because we consider only part of the stream.

Answer 33

1) Data local: map task and HDFS and block in the same node. 2) Rack local: map task and HDFS and block in the same rack. 3) Off-rack: map task in one rack and HDFS block in another.

Answer 34

Store Key/Value Pairs Value can be complex datatype Dynamo, Redis, Voldemort CRUD Range queries

Answer 35

BASE means Basically Available, Soft State, Eventual Consistency. The idea is to sacrifice strong consistency to gain faster response times in a more scalable manner. - High availability for first-tier services; - background clean up mechanism; - resolve problems optimistic; when an action violated consistency

Answer 36

Distributed sensor networks, mobile ad-hoc networks, social sensor, stock market.

Answer 37

1 - MR program start the Job 2 - Job get a new JobID from JobTrackers 3 - Job copy job resources to HDFS 4 - Job submit job to JobTracker 5 - JobTracker initialize the job 6 - JobTracker retrieve input splits from HDFS 7 - TaskTracker sends heart beat 8 - TaskTracker retrieve job resources 9 - TaskTracker lauches child 10 - Child runs map or reduce

Answer 38

- Push: holder of new info actively distributes it. PREDICTABLE Good when few nodes are informed (exponentialy growth) Slow to informe everyone, high probability some uninformed node won't get called O(log n), O(n log n) #msg - Pull: people actively call to obtain new. FAST CONVERGENCE. When few nodes are informed, startup is unpredictable: an informed node might not get called. If a fraction p still uninformed in this round, then p² will remain uninformed in the next. O(log log n), O(n) #msg - Push/Pull: predictability AND fast convergence A call B to push a rumor and concurrently tries to pull it from B. O(log n) #msg O(n) //not sure...

Answer 39

Performing computation on a graph data structure requires processing at each node. Each node contains node specific data as well as links (edges) to other nodes. Computation must traverse the graph and perform the computation step. How we traverse graph in MR? BFS is an iterated alogirthm over graphs. Frontier advances from origin by one level with each pass. MR: iterated passes throught MR - map some nodes, result includes additional nodes which are fed into successive MR passes. How do we represent a graph for this? Sending the entire graph to a (thousands of) map tasks involves an enourmous amount of memory. Need to carefully consider how we represent graphs. - direct references: objects, references from each node to its neighbors. Not easily serializable. - adjaceny matrix: Mij = '1' implies a link from node i to j. Problem: full o zeros. - sparse matrix: only include non-zero elements.

Answer 40

Assign each of your process an ID, then make sure you include that ID and the last vector clock your saw for a given value when store modification/send a message. Algorithm for generating a partial ordering of events in a distributed system and detecting causality violations.

Answer 41

Count-min sketch is a probabilistic couting algorithm. We keep a 2 dimension array (h,r). We have h hash function that map to range 0 .. (r - 1). For every element, we compute all hash values and increment in 1 the result bucket. To know how often we did see an item, we calculate the hash for it and retrieve the value of all buckets. Then take the minimum of the corresponding values. We get the minimum value because more than one item can be mapped to the same bucket, but no more than the minimum value the same item had been inserted.

Exam Flashcards

(66 cards)