All Flashcards

1
Q

DER algorithm

A

Step 1: replication
Step 2: local inner join
Step 3: select ROW ID of left table with no matches
Step 4: redistribute the ROW ID
Step 5: Store ROW ID that appears as many times as the number of processors
Step 6: Join

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

ROJA Algorithm

A

Step 1: reshuffle the data based on the join attribute
Step 2: each processor performs the local outer join

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

DOJA Algorithm

A

step 1: replication. we duplicate small table

step 2: local inner join

step 3: hash redistribute the inner join result based on attribute X

step 4: local outer join

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

OJSO

A

When joining 3 tables:
1. Do redistribution on join attribute (same as ROJA)
2. local join (same as ROJA)
3. redistribute joined table & third table based on the join attribute
(ignore dangling records)
4. local join

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

local join: scan cost
(after divide & broadcast)
(Optimizing main memory)

A

((R_i/P)+(S_i/P)-(M/P)) * IO

M = size of memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Speed up

A

concerned with processing speed while the same workload

elapsed time on uniprocessor / elapsed time on multiprocessor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Scale up

A

concerned with increasing workload while maintaining processing speed

time on small system / time on large system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Downside of a shared-nothing architecture

A

load balancing becomes difficult

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Downside of Shared-Memory & Shared-Disk architectures

A

suffers from memory and bus contention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Process activation or involvement of parallel search algorithms

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Process activation or involvement of parallel search algorithms

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Key comparison of parallel search algorithms

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Divide and Broadcast […] join: Transfer cost

A

(S_i/P) x (n-1) x(m_p+m_l)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Divide and Broadcast […] join: Receiving cost

A

(S/P - S_i/P) x (m_p)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Divide and Broadcast […] join: Scan cost

A

(S_i/P) x IO

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Divide and Broadcast […] join: Select cost

A

|S_i| x (t_r + t_w)

16
Q

Divide and Broadcast […] join: Disk storing cost

A

(S/P - S_i/P) x IO

17
Q

Local join: Scan cost
(after divide & broadcast)

A

((R_i/P) + (S/P)) x IO

18
Q

Local Join: Select cost
(after divide & broadcast)

A

(|R_i| +|S|) x (tr + tw)

19
Q

Local Join: Join cost
(after divide & broadcast)

A

(|R_i| x (t_r + t_h) + (|S| x(t_r + t_h + t_j))

20
Q

Local join: Generating result cost
(after divide & broadcast)

A

|R_i| x \sigmaj x |S| x t_w

21
Q

Redistribution Step: select cost

A

(|Ri| + |Si|) (tr + tw)

22
Q

Redistribution step: finding destination cost

A

(|Ri| + |Si|) (td)

23
Q

Parallel Merge-All Sort

A
24
Q

Parallel Binary-Merge Sort

A
25
Q

Parallel Redistribution Binary-Merge Sort

A
26
Q

Parallel Redistribution Merge-All Sort

A
27
Q

Parallel Partitioned Sort

A
28
Q

Parallel Group-By

A
29
Q

DT: data parallelism

A

Data is partitioned vertically

Locally

  1. each processor calculates entropy for its features

Globally

  1. each processor shares entropy and target class count
  2. determine the best splitting attribute
  3. share which records to include in the subsequent partitions

iterative process: repeat the steps above

30
Q

DT: result parallelism

A
  1. data is partitioned horizontally
  2. each processor needs to exchange counts with other processors
  3. determine the best splitting attribute for the root node
  4. redistribute records by attribute to the processor assigned to this attribute
  5. each processor shares entropy and information gain values
  6. determine 2. splitting attribute
31
Q

Formula: Entropy

A
32
Q

ID3 algorithm

A
  1. compute entropy for dataset
  2. for every attribute/feature:
    1. calculate entropy for all categorical values
    2. take average information entropy for the current attributeS
    3. calculate the gain for the current attribute
  3. pick the highest gain attribute
  4. repeat until the tree is complete
33
Q

k-means: data parallelism

A

Initialization:

  1. Divide the dataset among processors
  2. Replicate the initial centroids to each processor

In each processor:

  1. Compute the distance of each local data point to the centroids
  2. Construct local clusters
  3. Maintain a sum and a count of each local cluster
  4. At each iteration, the master process computes the new means and sends them to all processors
  5. Repeat steps 1-4 until convergence
34
Q

k-means: result parallelism

A

Initialization

  1. Divide dataset D among P processors, and
    sort the data within each processor
  2. Divide the initial centroids among processors
  3. Allocate data points to the nearest cluster centroid

In each processor:

  1. For each cluster, calculate the distance between each local data point and the cluster centroid
  2. For extreme low and high data points in each cluster:
    1. If they are closer to centroid of other cluster of the same processor, then move these data points into
      the new cluster
    2. If they are closer to centroid of other cluster of different processor, then move these data points into
      a new processor
  3. Repeat steps 1 & 2 until convergence
35
Q

Formula: Similarity

A
36
Q

Bounded data stream

A

The bounded stream will have a defined start and an end. […] we can ingest the entire data set before starting any computation

37
Q

Dense Vector –> Sparse Vector

A
38
Q

What is the purpose of watermarking?

A

to handle events that arrive late to the application
(e.g. event time =/= received time)