Algorithms and computations for big data Flashcards
The four parallel paradigms
- Multithreading
- Message parsint interface (MPI)
- Map-Reduce
- Spark
Learning outcomes
- Knowledge and understanding
- discuss important technological aspects when designing and implementing analysis solutions for large-scale data,
- describe data models and software standards for sharing data on the web.
- Skills and abilities
- use Python to implement applications for transforming and analyzing large-scale data with appropriate software frameworks,
- provide access and utilize structured data over the web with appropriate data models and software tools
- Judgement and approach
- suggest appropriate computational infrastructures for analysis tasks and discuss their advantages and drawbacks,
- discuss advantages and drawbacks of different strategies for dissemination of data,
- discuss large-scale data processing from an ethical point of view.
Levels of parallelism
- Multi-core CPU’s
- Several CPU’s per system
- Clusters of multiple systems
Speedup
Given two variants of a program solving the same problem- a baseline, and a optimzed implementation, faster algorithm, or parallel version- with running times t and t’ (optimized time).
S = t/t’
Amdahl’s law
- Propostion of the code that is parallizable

S(f,s) = 1/((1-f)+f/s)
as S goes to infinity then 1/(1-f)
- No, only some programs benefit from parallelization and their maximal acceleration is bounded.
Multicore CPU are technical necessity ?
Yes. Cooling is a bottleneck when increasing clock frequency of a CPU
Flynn’s taxonomy
- SIMD: GPU
- MIMD: Multi-core processors

Memory hierarchy

cache memory
Small hi-speed memory attached to processor core
Symmetric Multiprocessor (SMP)
- Multiple CPU’s (typically 2-8; each can have multiple cores) share the same main memory
- One adress space
High performance computing (HPC)

Classical HPC compute cluster is an appropriate computer architecture for Monte Carlo simulations like the parallel Pi example. Assume that the parallelization across nodes is not a problem.
Yes, HPC is a good computer architecture for Monto carlo simulations.
Difference between HPC and commodity

Distributed compute cluster (commodity hardware)

Workload comparison between HPC and Datascience

Data-intensive Compute Cluster

Latency vs computation
Computation is cheap, datamovement is very expensive
Multithreading
- Threads communicate via variables in shared memory
- Simultaenous read acess to data
- Write access to same data require lock
In multi-threaded programming the time needed to communicate between two threads is typically on the order of
200ns
In multithreaded- programming all threads can simultenously….
…read nd write, but not the same data
Threads writing to memory incorrectly

Threds writing to memory correctly

Locking
- Protects from errors due to parallel writes
- Lock
- is acquired before reading/writing data
- is released when done
- assures serial access to shared mem
Deadlocks
- Execution stops because 2 or more threads wait for each other
- Two threads need to write variables a & b
- Thread 1 locks a and waits for b
- Thread 2 locks b and waits for a
Questions to ask when parallelizing
- Which sections can be parallelized?
- What needs to be serial?
- When is communication necessary between the thread?
- How much data needs to becommunicated?
Load balancing
Distributing the workload equally amongst the threads
Message parsing interface (MPI)
- The message passing interface (MPI) is a standardized means of exchanging messages between multiple computers running a parallel program across distributed memory.
- Used for high performance computing
- Usually on super computers
- Substantial latencies for thousands of cores
- Lower throughput for sharing large amount of data
- Communication incl. exchange of data via highspeed network (5000ns + 2x RAM acess)
Passing a message over Infiniband takes
about 5000 ns. This is how many times
slower than a memory access?
11-50 times slower
Word count and Cahracter count Map-Reduce

Combiners
- Semi reduce
- Word count
- Each combiner adds up n/k values
- Each reduces gets k values to add up
- n amount of inputs
- k amount of nodes
- Use combiners to utiilize more cores
Map-reduce
- Two parallel phases
- Map: Map each input with a value and a key
- Reduce: For each key collect its respective values and aggregate
- Shuffle process after the mapping phase and before reduce phase
- Mainly used for one-pass jobs (data sample only seen once)
- Theoretical speed up for mapper is amount of inputs
- The speed up for mapper is the amount of keys
- Practical speed is up is amount of nodes
Multi step in MRjob
- i.e using another reducer to find the most frequent item
Why cannot two reducers communicate ?
- Run on different machines
- Not available at the same time
Spark
- general-purpose cluster-computing framework
- computer clusters have each node set to perform the same task, controlled and scheduled by software.
- Uses Resilient distributed dataset (RDD)
- Fault tolerant: RDD can always be re-constructed if it fails
- Obtained from drive program (python script) read from HDFS or other
- Good for Iterative jobs: Many common machine learning algo-
rithms apply a function repeatedly to the same dataset
to optimize a parameter
In case of a disk or hardware failure on one node, when
using frameworks like Hadoop (MapReduce) or Spark with
data on a HDFS file system, computations
only results for failingnode haveto be recomputed


Frequent keys spakr program

Hadoop distributed file system (HDFS)
- software framework for distributed storage and processing of big data using the MapReduce programming model.
- Namenode is aware of distribution of chunks
and distributed map-reduce jobs accordingly - Computations are performed where
data is stored - Failure of a node or even a rack can be
compensated without invalidating previous
computations

Trie (prefix tree)

Bloom filter
- A bloom filter is a data structure designed to tell you, rapidly and memory-efficientlywhether an element is present in a set. The paid trade off for this effeciency is that a bloomfilter is probabillistic
- Operations:
- Insert item x into Bloom filter B
- Query: x present in B
- If x present in B, the query will always be answered correctly
- With probability p, the query might be answered positively even if x is not present in B.
Describe how bloom filter works

Trie and parallization

Bloom filter: Cache Behaviour
For sufficiently large b, accessing each bit will likely cause
a cache miss. The expected number of cache misses for
insert and query operations, when the item is not in the
Bloom filter is the same.
- Cache miss is a state where the data requested for processing by a component or application is not found in the cache memory.
- Insert
- worst case: n cache misses per insert
- Query
- item present / false positve: like insert
- item not present: average much lower than worst case
Describe how the error in a bloom filter
- False positives (positive class is if the element is in the bloom filter)
- A false positive occurs if the other elements set the specific bits

Parallization of Bloom filter

Data analysis on large texts
- Text T:
- Natural language
- Biological sequences, any other sequence of discrete observations
- Error or event logs (error codes/event types = alphabet)
- Questions:
- Is s a substring of T?
- How often appears s in T?
- Suffix tree
- • Use tree-traversal to populate leaf-counts for all internal nodes (once!)
Spark optimization strategy
• Transformations are lazy
• Only Actions trigger computations
• Spark maintains directed acyclic graphs to represent workflow
• Scheduler assign computational tasks to workers optimizing compute-data co-
location
Spark Architecture

Count min sketch
an efficient algo-rithm for counting stream of data. It uses hash functions to map events to fre-quencies but unlike a hash table uses lessspace, at the expense of over countingsome events due to collisions

Resilient distributed dataset (RDD)
- Fault tolerant
- RDD can always be reconstructed if a node fails
- Two operations on RDD
- transformations: create new RDD from input
- action: produce output
Multi processing queue
- Once a thread is done wiht its work it can access more work from the queue