Algorithms and computations for big data Flashcards

1
Q

The four parallel paradigms

A
  1. Multithreading
  2. Message parsint interface (MPI)
  3. Map-Reduce
  4. Spark
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Learning outcomes

A
  • Knowledge and understanding
    • discuss important technological aspects when designing and implementing analysis solutions for large-scale data,
    • describe data models and software standards for sharing data on the web.
  • Skills and abilities
    • use Python to implement applications for transforming and analyzing large-scale data with appropriate software frameworks,
    • provide access and utilize structured data over the web with appropriate data models and software tools
  • Judgement and approach
    • suggest appropriate computational infrastructures for analysis tasks and discuss their advantages and drawbacks,
    • discuss advantages and drawbacks of different strategies for dissemination of data,
    • discuss large-scale data processing from an ethical point of view.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Levels of parallelism

A
  • Multi-core CPU’s
  • Several CPU’s per system
  • Clusters of multiple systems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Speedup

A

Given two variants of a program solving the same problem- a baseline, and a optimzed implementation, faster algorithm, or parallel version- with running times t and t’ (optimized time).

S = t/t’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Amdahl’s law

A
  • Propostion of the code that is parallizable

S(f,s) = 1/((1-f)+f/s)

as S goes to infinity then 1/(1-f)

  • No, only some programs benefit from parallelization and their maximal acceleration is bounded.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Multicore CPU are technical necessity ?

A

Yes. Cooling is a bottleneck when increasing clock frequency of a CPU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Flynn’s taxonomy

A
  • SIMD: GPU
  • MIMD: Multi-core processors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Memory hierarchy

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

cache memory

A

Small hi-speed memory attached to processor core

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Symmetric Multiprocessor (SMP)

A
  • Multiple CPU’s (typically 2-8; each can have multiple cores) share the same main memory
  • One adress space
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

High performance computing (HPC)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Classical HPC compute cluster is an appropriate computer architecture for Monte Carlo simulations like the parallel Pi example. Assume that the parallelization across nodes is not a problem.

A

Yes, HPC is a good computer architecture for Monto carlo simulations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Difference between HPC and commodity

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Distributed compute cluster (commodity hardware)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Workload comparison between HPC and Datascience

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Data-intensive Compute Cluster

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Latency vs computation

A

Computation is cheap, datamovement is very expensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Multithreading

A
  • Threads communicate via variables in shared memory
  • Simultaenous read acess to data
  • Write access to same data require lock
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

In multi-threaded programming the time needed to communicate between two threads is typically on the order of

A

200ns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In multithreaded- programming all threads can simultenously….

A

…read nd write, but not the same data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Threads writing to memory incorrectly

22
Q

Threds writing to memory correctly

23
Q

Locking

A
  • Protects from errors due to parallel writes
  • Lock
    • is acquired before reading/writing data
    • is released when done
    • assures serial access to shared mem
24
Q

Deadlocks

A
  • Execution stops because 2 or more threads wait for each other
  • Two threads need to write variables a & b
    • Thread 1 locks a and waits for b
    • Thread 2 locks b and waits for a
25
Q

Questions to ask when parallelizing

A
  • Which sections can be parallelized?
  • What needs to be serial?
  • When is communication necessary between the thread?
  • How much data needs to becommunicated?
26
Q

Load balancing

A

Distributing the workload equally amongst the threads

27
Q

Message parsing interface (MPI)

A
  • The message passing interface (MPI) is a standardized means of exchanging messages between multiple computers running a parallel program across distributed memory.
  • Used for high performance computing
  • Usually on super computers
  • Substantial latencies for thousands of cores
  • Lower throughput for sharing large amount of data
  • Communication incl. exchange of data via highspeed network (5000ns + 2x RAM acess)
28
Q

Passing a message over Infiniband takes
about 5000 ns. This is how many times
slower than a memory access?

A

11-50 times slower

29
Q

Word count and Cahracter count Map-Reduce

30
Q

Combiners

A
  • Semi reduce
  • Word count
    • Each combiner adds up n/k values
    • Each reduces gets k values to add up
    • n amount of inputs
    • k amount of nodes
  • Use combiners to utiilize more cores
31
Q

Map-reduce

A
  • Two parallel phases
  1. Map: Map each input with a value and a key
  2. Reduce: For each key collect its respective values and aggregate
  • Shuffle process after the mapping phase and before reduce phase
  • Mainly used for one-pass jobs (data sample only seen once)
  • Theoretical speed up for mapper is amount of inputs
  • The speed up for mapper is the amount of keys
  • Practical speed is up is amount of nodes
32
Q

Multi step in MRjob

A
  • i.e using another reducer to find the most frequent item
33
Q

Why cannot two reducers communicate ?

A
  • Run on different machines
  • Not available at the same time
34
Q

Spark

A
  • general-purpose cluster-computing framework
    • computer clusters have each node set to perform the same task, controlled and scheduled by software.
  • Uses Resilient distributed dataset (RDD)
    • Fault tolerant: RDD can always be re-constructed if it fails
    • Obtained from drive program (python script) read from HDFS or other
  • Good for Iterative jobs: Many common machine learning algo-

rithms apply a function repeatedly to the same dataset

to optimize a parameter

35
Q

In case of a disk or hardware failure on one node, when
using frameworks like Hadoop (MapReduce) or Spark with
data on a HDFS file system, computations

A

only results for failingnode haveto be recomputed

37
Q

Frequent keys spakr program

38
Q

Hadoop distributed file system (HDFS)

A
  • software framework for distributed storage and processing of big data using the MapReduce programming model.
  • Namenode is aware of distribution of chunks
    and distributed map-reduce jobs accordingly
  • Computations are performed where
    data is stored
  • Failure of a node or even a rack can be
    compensated without invalidating previous
    computations
39
Q

Trie (prefix tree)

40
Q

Bloom filter

A
  • A bloom filter is a data structure designed to tell you, rapidly and memory-efficientlywhether an element is present in a set. The paid trade off for this effeciency is that a bloomfilter is probabillistic
  • Operations:
    • Insert item x into Bloom filter B
    • Query: x present in B
  • If x present in B, the query will always be answered correctly
  • With probability p, the query might be answered positively even if x is not present in B.
41
Q

Describe how bloom filter works

42
Q

Trie and parallization

43
Q

Bloom filter: Cache Behaviour
For sufficiently large b, accessing each bit will likely cause
a cache miss. The expected number of cache misses for
insert and query operations, when the item is not in the
Bloom filter is the same.

A
  • Cache miss is a state where the data requested for processing by a component or application is not found in the cache memory.
  • Insert
    • worst case: n cache misses per insert
  • Query
    • item present / false positve: like insert
    • item not present: average much lower than worst case
44
Q

Describe how the error in a bloom filter

A
  • False positives (positive class is if the element is in the bloom filter)
  • A false positive occurs if the other elements set the specific bits
45
Q

Parallization of Bloom filter

46
Q

Data analysis on large texts

A
  • Text T:
    • Natural language
    • Biological sequences, any other sequence of discrete observations
    • Error or event logs (error codes/event types = alphabet)
  • Questions:
    • Is s a substring of T?
    • How often appears s in T?
  • Suffix tree
    • • Use tree-traversal to populate leaf-counts for all internal nodes (once!)
47
Q

Spark optimization strategy

A

• Transformations are lazy
• Only Actions trigger computations
• Spark maintains directed acyclic graphs to represent workflow
• Scheduler assign computational tasks to workers optimizing compute-data co-
location

48
Q

Spark Architecture

49
Q

Count min sketch

A

an efficient algo-rithm for counting stream of data. It uses hash functions to map events to fre-quencies but unlike a hash table uses lessspace, at the expense of over countingsome events due to collisions

50
Q

Resilient distributed dataset (RDD)

A
  • Fault tolerant
  • RDD can always be reconstructed if a node fails
  • Two operations on RDD
    • transformations: create new RDD from input
    • action: produce output
51
Q

Multi processing queue

A
  • Once a thread is done wiht its work it can access more work from the queue