Algorithms and computations for big data Flashcards

Question

Questions to ask when parallelizing

Answer 1

* Which sections can be parallelized? * What needs to be serial? * When is communication necessary between the thread? * How much data needs to becommunicated?

Answer 2

Distributing the workload **equally** amongst the threads

Answer 3

* The message passing interface (MPI) is a standardized means of exchanging messages between multiple computers running a parallel program across distributed memory. * Used for high performance computing * Usually on super computers * Substantial latencies for thousands of cores * Lower throughput for sharing large amount of data * Communication incl. exchange of data via highspeed network (5000ns + 2x RAM acess)

Answer 4

11-50 times slower

Answer 5

* Semi reduce * Word count * Each combiner adds up n/k values * Each reduces gets k values to add up * *n* amount of inputs * *k* amount of nodes * Use combiners to utiilize more cores

Answer 6

* Two parallel phases 1. Map: Map each input with a value and a key 2. Reduce: For each key collect its respective values and aggregate * Shuffle process after the *mapping* phase and before reduce phase * Mainly used for **one-pass jobs** (data sample only seen once) * Theoretical speed up for mapper is amount of inputs * The speed up for mapper is the amount of keys * Practical speed is up is amount of nodes

Answer 7

* i.e using another reducer to find the most frequent item

Answer 8

* Run on different machines * Not available at the same time

Answer 9

* general-purpose cluster-computing framework * computer clusters have each node set to perform the same task, controlled and scheduled by software. * Uses Resilient distributed dataset (RDD) * Fault tolerant: RDD can always be re-constructed if it fails * Obtained from drive program (python script) read from HDFS or other * Good for Iterative jobs: Many common machine learning algo- rithms apply a function repeatedly to the same dataset to optimize a parameter

Answer 10

only results for failingnode haveto be recomputed

Answer 11

* software framework for distributed storage and processing of big data using the MapReduce programming model. * Namenode is aware of distribution of chunks and distributed map-reduce jobs accordingly * Computations are performed where data is stored * Failure of a node or even a rack can be compensated without invalidating previous computations

Answer 12

* A bloom filter is a data structure designed to tell you, rapidly and memory-efficientlywhether an element is present in a set. The paid trade off for this effeciency is that a bloomfilter is probabillistic * Operations: * Insert item x into Bloom filter B * Query: x present in B * If x present in B, the query will always be answered correctly * With probability p, the query might be answered positively even if x is not present in B.

Answer 13

* Cache miss is a state where the data requested for processing by a component or application is not found in the cache memory. * Insert * worst case: *n* cache misses per insert * Query * item present / false positve: like insert * item **not** present: average much lower than worst case

Answer 14

* False positives (positive class is if the element is in the bloom filter) * A false positive occurs if the other elements set the specific bits

Answer 15

* Text T: * Natural language * Biological sequences, any other sequence of discrete observations * Error or event logs (error codes/event types = alphabet) * Questions: * Is s a substring of T? * How often appears s in T? * Suffix tree * • Use tree-traversal to populate leaf-counts for all internal nodes (once!)

Answer 16

• Transformations are **lazy** • Only Actions trigger computations • Spark maintains directed acyclic graphs to represent workflow • Scheduler assign computational tasks to workers optimizing compute-data co- location

Answer 17

an efficient algo-rithm for counting stream of data. It uses hash functions to map events to fre-quencies but unlike a hash table uses lessspace, at the expense of over countingsome events due to collisions

Answer 18

* Fault tolerant * RDD can always be reconstructed if a node fails * Two operations on RDD * transformations: create new RDD from input * action: produce output

Answer 19

* Once a thread is done wiht its work it can access more work from the queue