Algorithms and computations for big data Flashcards

1
Q

The four parallel paradigms

A
  1. Multithreading
  2. Message parsint interface (MPI)
  3. Map-Reduce
  4. Spark
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Learning outcomes

A
  • Knowledge and understanding
    • discuss important technological aspects when designing and implementing analysis solutions for large-scale data,
    • describe data models and software standards for sharing data on the web.
  • Skills and abilities
    • use Python to implement applications for transforming and analyzing large-scale data with appropriate software frameworks,
    • provide access and utilize structured data over the web with appropriate data models and software tools
  • Judgement and approach
    • suggest appropriate computational infrastructures for analysis tasks and discuss their advantages and drawbacks,
    • discuss advantages and drawbacks of different strategies for dissemination of data,
    • discuss large-scale data processing from an ethical point of view.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Levels of parallelism

A
  • Multi-core CPU’s
  • Several CPU’s per system
  • Clusters of multiple systems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Speedup

A

Given two variants of a program solving the same problem- a baseline, and a optimzed implementation, faster algorithm, or parallel version- with running times t and t’ (optimized time).

S = t/t’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Amdahl’s law

A
  • Propostion of the code that is parallizable

S(f,s) = 1/((1-f)+f/s)

as S goes to infinity then 1/(1-f)

  • No, only some programs benefit from parallelization and their maximal acceleration is bounded.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Multicore CPU are technical necessity ?

A

Yes. Cooling is a bottleneck when increasing clock frequency of a CPU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Flynn’s taxonomy

A
  • SIMD: GPU
  • MIMD: Multi-core processors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Memory hierarchy

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

cache memory

A

Small hi-speed memory attached to processor core

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Symmetric Multiprocessor (SMP)

A
  • Multiple CPU’s (typically 2-8; each can have multiple cores) share the same main memory
  • One adress space
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

High performance computing (HPC)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Classical HPC compute cluster is an appropriate computer architecture for Monte Carlo simulations like the parallel Pi example. Assume that the parallelization across nodes is not a problem.

A

Yes, HPC is a good computer architecture for Monto carlo simulations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Difference between HPC and commodity

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Distributed compute cluster (commodity hardware)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Workload comparison between HPC and Datascience

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Data-intensive Compute Cluster

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Latency vs computation

A

Computation is cheap, datamovement is very expensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Multithreading

A
  • Threads communicate via variables in shared memory
  • Simultaenous read acess to data
  • Write access to same data require lock
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

In multi-threaded programming the time needed to communicate between two threads is typically on the order of

A

200ns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In multithreaded- programming all threads can simultenously….

A

…read nd write, but not the same data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Threads writing to memory incorrectly

22
Q

Threds writing to memory correctly

23
Q

Locking

A
  • Protects from errors due to parallel writes
  • Lock
    • is acquired before reading/writing data
    • is released when done
    • assures serial access to shared mem
24
Q

Deadlocks

A
  • Execution stops because 2 or more threads wait for each other
  • Two threads need to write variables a & b
    • Thread 1 locks a and waits for b
    • Thread 2 locks b and waits for a
25
Questions to ask when parallelizing
* Which sections can be parallelized? * What needs to be serial? * When is communication necessary between the thread? * How much data needs to becommunicated?
26
*Load balancing*
Distributing the workload **equally** amongst the threads
27
Message parsing interface (MPI)
* The message passing interface (MPI) is a standardized means of exchanging messages between multiple computers running a parallel program across distributed memory. * Used for high performance computing * Usually on super computers * Substantial latencies for thousands of cores * Lower throughput for sharing large amount of data * Communication incl. exchange of data via highspeed network (5000ns + 2x RAM acess)
28
Passing a message over Infiniband takes about 5000 ns. This is how many times slower than a memory access?
11-50 times slower
29
Word count and Cahracter count Map-Reduce
30
*Combiners*
* Semi reduce * Word count * Each combiner adds up n/k values * Each reduces gets k values to add up * *n* amount of inputs * *k* amount of nodes * Use combiners to utiilize more cores
31
*Map-reduce*
* Two parallel phases 1. Map: Map each input with a value and a key 2. Reduce: For each key collect its respective values and aggregate * Shuffle process after the *mapping* phase and before reduce phase * Mainly used for **one-pass jobs** (data sample only seen once) * Theoretical speed up for mapper is amount of inputs * The speed up for mapper is the amount of keys * Practical speed is up is amount of nodes
32
Multi step in MRjob
* i.e using another reducer to find the most frequent item
33
Why cannot two reducers communicate ?
* Run on different machines * Not available at the same time
34
Spark
* general-purpose cluster-computing framework * computer clusters have each node set to perform the same task, controlled and scheduled by software. * Uses Resilient distributed dataset (RDD) * Fault tolerant: RDD can always be re-constructed if it fails * Obtained from drive program (python script) read from HDFS or other * Good for Iterative jobs: Many common machine learning algo- rithms apply a function repeatedly to the same dataset to optimize a parameter
35
In case of a disk or hardware failure on one node, when using frameworks like Hadoop (MapReduce) or Spark with data on a HDFS file system, computations
only results for failingnode haveto be recomputed
36
37
Frequent keys spakr program
38
Hadoop distributed file system (HDFS)
* software framework for distributed storage and processing of big data using the MapReduce programming model. * Namenode is aware of distribution of chunks and distributed map-reduce jobs accordingly * Computations are performed where data is stored * Failure of a node or even a rack can be compensated without invalidating previous computations
39
Trie (prefix tree)
40
Bloom filter
* A bloom filter is a data structure designed to tell you, rapidly and memory-efficientlywhether an element is present in a set. The paid trade off for this effeciency is that a bloomfilter is probabillistic * Operations: * Insert item x into Bloom filter B * Query: x present in B * If x present in B, the query will always be answered correctly * With probability p, the query might be answered positively even if x is not present in B.
41
Describe how bloom filter works
42
Trie and parallization
43
Bloom filter: Cache Behaviour For sufficiently large b, accessing each bit will likely cause a cache miss. The expected number of cache misses for insert and query operations, when the item is not in the Bloom filter is the same.
* Cache miss is a state where the data requested for processing by a component or application is not found in the cache memory. * Insert * worst case: *n* cache misses per insert * Query * item present / false positve: like insert * item **not** present: average much lower than worst case
44
Describe how the error in a bloom filter
* False positives (positive class is if the element is in the bloom filter) * A false positive occurs if the other elements set the specific bits
45
Parallization of *Bloom filter*
46
Data analysis on large texts
* Text T: * Natural language * Biological sequences, any other sequence of discrete observations * Error or event logs (error codes/event types = alphabet) * Questions: * Is s a substring of T? * How often appears s in T? * Suffix tree * • Use tree-traversal to populate leaf-counts for all internal nodes (once!)
47
Spark optimization strategy
• Transformations are **lazy** • Only Actions trigger computations • Spark maintains directed acyclic graphs to represent workflow • Scheduler assign computational tasks to workers optimizing compute-data co- location
48
Spark Architecture
49
Count min sketch
an efficient algo-rithm for counting stream of data. It uses hash functions to map events to fre-quencies but unlike a hash table uses lessspace, at the expense of over countingsome events due to collisions
50
*Resilient distributed dataset* (RDD)
* Fault tolerant * RDD can always be reconstructed if a node fails * Two operations on RDD * transformations: create new RDD from input * action: produce output
51
Multi processing queue
* Once a thread is done wiht its work it can access more work from the queue