Algorithms and computations for big data Flashcards
The four parallel paradigms
- Multithreading
- Message parsint interface (MPI)
- Map-Reduce
- Spark
Learning outcomes
- Knowledge and understanding
- discuss important technological aspects when designing and implementing analysis solutions for large-scale data,
- describe data models and software standards for sharing data on the web.
- Skills and abilities
- use Python to implement applications for transforming and analyzing large-scale data with appropriate software frameworks,
- provide access and utilize structured data over the web with appropriate data models and software tools
- Judgement and approach
- suggest appropriate computational infrastructures for analysis tasks and discuss their advantages and drawbacks,
- discuss advantages and drawbacks of different strategies for dissemination of data,
- discuss large-scale data processing from an ethical point of view.
Levels of parallelism
- Multi-core CPU’s
- Several CPU’s per system
- Clusters of multiple systems
Speedup
Given two variants of a program solving the same problem- a baseline, and a optimzed implementation, faster algorithm, or parallel version- with running times t and t’ (optimized time).
S = t/t’
Amdahl’s law
- Propostion of the code that is parallizable

S(f,s) = 1/((1-f)+f/s)
as S goes to infinity then 1/(1-f)
- No, only some programs benefit from parallelization and their maximal acceleration is bounded.
Multicore CPU are technical necessity ?
Yes. Cooling is a bottleneck when increasing clock frequency of a CPU
Flynn’s taxonomy
- SIMD: GPU
- MIMD: Multi-core processors

Memory hierarchy

cache memory
Small hi-speed memory attached to processor core
Symmetric Multiprocessor (SMP)
- Multiple CPU’s (typically 2-8; each can have multiple cores) share the same main memory
- One adress space
High performance computing (HPC)

Classical HPC compute cluster is an appropriate computer architecture for Monte Carlo simulations like the parallel Pi example. Assume that the parallelization across nodes is not a problem.
Yes, HPC is a good computer architecture for Monto carlo simulations.
Difference between HPC and commodity

Distributed compute cluster (commodity hardware)

Workload comparison between HPC and Datascience

Data-intensive Compute Cluster

Latency vs computation
Computation is cheap, datamovement is very expensive
Multithreading
- Threads communicate via variables in shared memory
- Simultaenous read acess to data
- Write access to same data require lock
In multi-threaded programming the time needed to communicate between two threads is typically on the order of
200ns
In multithreaded- programming all threads can simultenously….
…read nd write, but not the same data
Threads writing to memory incorrectly

Threds writing to memory correctly

Locking
- Protects from errors due to parallel writes
- Lock
- is acquired before reading/writing data
- is released when done
- assures serial access to shared mem
Deadlocks
- Execution stops because 2 or more threads wait for each other
- Two threads need to write variables a & b
- Thread 1 locks a and waits for b
- Thread 2 locks b and waits for a











