All Flashcards
DER algorithm
Step 1: replication
Step 2: local inner join
Step 3: select ROW ID of left table with no matches
Step 4: redistribute the ROW ID
Step 5: Store ROW ID that appears as many times as the number of processors
Step 6: Join
ROJA Algorithm
Step 1: reshuffle the data based on the join attribute
Step 2: each processor performs the local outer join
DOJA Algorithm
step 1: replication. we duplicate small table
step 2: local inner join
step 3: hash redistribute the inner join result based on attribute X
step 4: local outer join
OJSO
When joining 3 tables:
1. Do redistribution on join attribute (same as ROJA)
2. local join (same as ROJA)
3. redistribute joined table & third table based on the join attribute
(ignore dangling records)
4. local join
local join: scan cost
(after divide & broadcast)
(Optimizing main memory)
((R_i/P)+(S_i/P)-(M/P)) * IO
M = size of memory
Speed up
concerned with processing speed while the same workload
elapsed time on uniprocessor / elapsed time on multiprocessor
Scale up
concerned with increasing workload while maintaining processing speed
time on small system / time on large system
Downside of a shared-nothing architecture
load balancing becomes difficult
Downside of Shared-Memory & Shared-Disk architectures
suffers from memory and bus contention
Process activation or involvement of parallel search algorithms
Process activation or involvement of parallel search algorithms
Key comparison of parallel search algorithms
Divide and Broadcast […] join: Transfer cost
(S_i/P) x (n-1) x(m_p+m_l)
Divide and Broadcast […] join: Receiving cost
(S/P - S_i/P) x (m_p)
Divide and Broadcast […] join: Scan cost
(S_i/P) x IO
Divide and Broadcast […] join: Select cost
|S_i| x (t_r + t_w)
Divide and Broadcast […] join: Disk storing cost
(S/P - S_i/P) x IO
Local join: Scan cost
(after divide & broadcast)
((R_i/P) + (S/P)) x IO
Local Join: Select cost
(after divide & broadcast)
(|R_i| +|S|) x (tr + tw)
Local Join: Join cost
(after divide & broadcast)
(|R_i| x (t_r + t_h) + (|S| x(t_r + t_h + t_j))
Local join: Generating result cost
(after divide & broadcast)
|R_i| x \sigmaj x |S| x t_w
Redistribution Step: select cost
(|Ri| + |Si|) (tr + tw)
Redistribution step: finding destination cost
(|Ri| + |Si|) (td)
Parallel Merge-All Sort
Parallel Binary-Merge Sort
Parallel Redistribution Binary-Merge Sort
Parallel Redistribution Merge-All Sort
Parallel Partitioned Sort
Parallel Group-By
DT: data parallelism
Data is partitioned vertically
Locally
- each processor calculates entropy for its features
Globally
- each processor shares entropy and target class count
- determine the best splitting attribute
- share which records to include in the subsequent partitions
iterative process: repeat the steps above
DT: result parallelism
- data is partitioned horizontally
- each processor needs to exchange counts with other processors
- determine the best splitting attribute for the root node
- redistribute records by attribute to the processor assigned to this attribute
- each processor shares entropy and information gain values
- determine 2. splitting attribute
- …
Formula: Entropy
ID3 algorithm
- compute entropy for dataset
- for every attribute/feature:
- calculate entropy for all categorical values
- take average information entropy for the current attributeS
- calculate the gain for the current attribute
- pick the highest gain attribute
- repeat until the tree is complete
k-means: data parallelism
Initialization:
- Divide the dataset among processors
- Replicate the initial centroids to each processor
In each processor:
- Compute the distance of each local data point to the centroids
- Construct local clusters
- Maintain a sum and a count of each local cluster
- At each iteration, the master process computes the new means and sends them to all processors
- Repeat steps 1-4 until convergence
k-means: result parallelism
Initialization
- Divide dataset D among P processors, and
sort the data within each processor - Divide the initial centroids among processors
- Allocate data points to the nearest cluster centroid
In each processor:
- For each cluster, calculate the distance between each local data point and the cluster centroid
- For extreme low and high data points in each cluster:
- If they are closer to centroid of other cluster of the same processor, then move these data points into
the new cluster - If they are closer to centroid of other cluster of different processor, then move these data points into
a new processor
- If they are closer to centroid of other cluster of the same processor, then move these data points into
- Repeat steps 1 & 2 until convergence
Formula: Similarity
Bounded data stream
The bounded stream will have a defined start and an end. […] we can ingest the entire data set before starting any computation
Dense Vector –> Sparse Vector
What is the purpose of watermarking?
to handle events that arrive late to the application
(e.g. event time =/= received time)