Exam Flashcards

1
Q

When does big data become too big? 3 things

A

1) When analysis become too slow or too unreliable
2) When systems become unresponsive
3) When day-to-day businesses are impacted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 3 biggest changes in Big Data?

A

1) In the past storage was expensive
2) Only crucial data was preserved
3) Companies only consulted historical data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

4 examples of data analysis that are computationally intensive and expensive

A

1) Online recommender systems
2) Frequent pattern mining
3) Multi-label classification
4) Subspace clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

3 aspects of big data

A

1) Volume: quantity
2) Variety: different types of data
3) Velocity: speed at which new data is coming in

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

2 solutions for big data

A

1) Invest in hardware
2) Use intelligent algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Goal of parallel computing

A

Leveraging the full potential of your multicore multicomputer system. The goal is to reduce computation time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Embarrassingly parallel

A

When an algorithm is split into smaller parts that can run in chunks simultaneously

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Linear speedup

A

Executing two tasks in parallel on two cores, should have the running time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Task parallelism

A

Multiple tasks are applied on the same data in parallel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data parallelism

A

A calculation is performed in parallel on many different data chunks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data dependencies

A

These prevent you to realize a linear speedup. The input of a segment of code depends on the output of another piece of code

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Distributed systems from a computational perspective

A

To solve a common problem, multiple systems work together using messages over the network to coordinate their actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Distributed systems from a service perspective

A

Multiple systems are available over the internet to deliver services to vast amount of users. It’s used to spread the load, but also to guarantee a certain level of availability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

5 characteristics of distributed hardware

A

1) Heterogenous hardware
2) Multiple systems work together using messages over the network to coordinate their actions
3) For the user it looks like one single system (transparant)
4) Tolerant to failures of one system
5) Scalable: it should be easy to add new nodes to increase computational capacity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Difference between parallel and distributed computing

A

Parallel computing typically requires one computer with multiple processors. Distributed computing involves several autonomous computer systems working on divided tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Definition of service

A

A single-task online operation with a request/response model. Availability is important. Load balancer distributes requests to different service instances each hosted on different nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Batch processing

A

An offline operation that takes a large amount of input data, runs a job to process it and produces some output data. Throughput is important, the time needed to process a dataset of a certain size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Stream processing

A

Offers near real-time processing somewhere between batch processing and online services. It’s not request driven, but consumes inputs and produces output. The idea is to process events shortly after they happen. Low delay is important

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does a cluster architecture looks like?

A

CPU, memory and disk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

3 challenges with large-scale computing for data mining

A

1) How to distribute computation?
2) How to make it easy to write distributed programs?
3) Machines fail

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Distributed file system

A

Provides global file namespace. Huge files (100s of GB to TB), data is rarely updated in place, reads and appends are common

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

4 characteristics of DFS (distributed file systems)

A

1) Chunk servers: file is split into contiguous chunks, each chunk is replicated for reliability, replicas are in different racks
2) Master node (name node): master node assigns map and reduce tasks to worker nodes, stores metadata about where files are stored, might be replicated
3) Client library for file access: talks to master to find chunker servers, connects directly to chunk servers to access data
4) Reliable distributed file system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Map-reduce: what does the programmer and the environment do?

A

Programmer: Map and Reduce and input files
Environment: partitioning, scheduling, node failures, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Workflow Map-Reduce

A

1) Read inputs as a set of key-value pairs
2) Map transforms input kv-paris into a new set of k’v’-pairs
3) Sorts & shuffles the k’v’-pairs to output nodes
4) All k’v’-pairs with a given k’ are sent to the same reduce
5) Reduce processes all k’v’-pairs grouped by key into new k’‘v’‘-pairs
6) Write the resulting pairs to files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

3 types of Map Reduce nodes

A

1) Master node: coordination and dealing with node failures
2) Worker nodes: chunk servers, map tasks, reduce tasks
3) Dynamic load balancing: ideally a lot more Map tasks than machines, spawn copies of in-progress tasks at the end of a phase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Data flow

A

Input and final output are stored in a DFS. The scheduler tries to schedule map tasks ‘close’ to physical storage location of input data. Intermediate results are stored on local FS of Map and Reduce workers. Output is often input to another Map-Reduce task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

4 rules of thumb for Map and Reduce tasks

A

1) Make M much larger than the number of nodes in the cluster
2) One DFS chunk per map is common
3) Improves dynamic load balancing and speeds up recovery from worker failures
4) Usually R is smaller than M, because output is spread across R files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Backup tasks

A

Slow workers significantly lengthen the job completion time. So near end of phase, spawn backup copies of tasks. This would dramatically shorten job completion time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Combiners

A

Often a Map task will produce many pairs of the form (k, v1), (k, v2)… for the same key ‘k’, for example for popular words in the word count example. Combining can save network time by pre-aggregating values in the mapper. This only works if reduce function is commutative and associative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What are the 3 cost measures?

A

1) Communication costs
2) Elapsed communication costs
3) (Elapsed) computation costs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Relational Data Model

A

It uses the mathematical concept of a relation as the formalism for describing and representing data. Can be thought of as a table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How are rows and columns named in a relational data model?

A

Rows are called tuples, columns are called attributes, defined by attribute names

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Cartesian product

A

Set of all possible combinations of those values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Relation schema

A

An expression R(A1, …, Ak) consisting of a relation name R and a sorted list (A1, …, Ak) of attributes. The relation is a set, so the order of the tuples does not matter, but the attributes must come in the same order in each tuple

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Database schema

A

Set of relation schemas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Relational algebra

A

Is based on a few basic operators that take one or two relations as input and map them to a new relation. These then can be combined into more complex mappings from database instances to relations. Core of the query language SQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Selection

A

An operator defined by a condition C that is applied to a single relation R. Select * from Drinkers where Drink = “Beer”

38
Q

Projection

A

An operator that removes columns from a table. Select Person from Drinkers

39
Q

Union

A

A set-theoretic operator applied to two relations R and S that share the same schema. Produces the set of all tuples present in at least one of the two relations

40
Q

Intersection

A

A set-theoretic operator applied to two relations R and S that share the same schema. Produces the set of all tuples present in both R and S

41
Q

Difference

A

A set-theoretic operator applied to two relations R and S that share the same schema. Produces the set of all tuples present in R, but not in S

42
Q

Natural Join

A

An operator applied to two relations R and S. The set of attributes of the resulting relation is an union of the sets of attributes of R and S. So Drinkers(Person, Drink) and Bars(Bar, Drink) are combined into a relation of all possible combinations of persons and bars such that the person drinks a drink that is served in that bar

43
Q

Grouping and aggregation

A

Produces some kind of aggregate of a particular attribute for each group of values of another attribute

44
Q

Importance of relational algebra

A

It’s at the core of data analysis. Operators/concepts are perfectly applicable to more flexibly stored data. Relational algebra is also implementable in Map Reduce, it is naturally parallelizable and has massive runtime gains

45
Q

Matrix-vector multiplication using map reduce

A

1) Assume that main matrix does not fit in memory
2) The vector fits in memory and every worker has a copy
3) The elements of the matrix are saved in a CSV file

Map > shuffle > reduce

46
Q

Matrix-vector multiplication example

A

Two matrices A = (aij) with dimensions p x q and B = (bjk) with dimensions q x r.
C = A X b. C = (cik) with dimensions p x r

47
Q

Conditions for sorting numbers using map-reduce

A

The range and distribution of numbers have to be known.

48
Q

2 shortcomings of map-reduce

A

1) Inflexible: forces data processing into map and reduce steps
2) Based on acyclic data flow from Disk to Disk (HDFS), so you need to write to Disk before/after map and reduce. Not efficient for iterative tasks

49
Q

In what ways does spark extend the map-reduce framework?

A

1) It generalizes to more complex Directed Acyclic Graphs (DAGs) that are optimized by Spark as it has a global view of the process
2) It abstract data into an object called Resilient Distributed Dataset (RDD), you can cache it
3) For ML algorithms that do multiple iterations over the dataset, this is much faster (more work in memory and fewer disk accesses)
4) Spark is more flexible in workflow (not just map and reduce phases) and data format (not only key-value pairs)

50
Q

Resilient Distributed Dataset RDD plus 5 characteristics

A

Data containers which transform Spark
- You can mix different kinds of transformations to create new RDDs
- Created by parallelizing a collection or reading a file or by transforming other RDDs
- Fault tolerant
- Flexible
- Memory instead of disk

51
Q

Spark operations

A

Many more than just map and reduce
1) Transformations
2) Actions

52
Q

Spark transformations

A

Map, filter, sample, groupByKey, reduceByKey, sortByKey, intersection, flatMap, union, join, etc.

53
Q

Spark actions

A

Reduce, count, collect, first, foreach, etc.

54
Q

Spark vs. map-reduce - 4 things

A

1) Spark can process data in-memory
2) Ease of use: Spark is easier to program
3) Data processing: Spark is more general
4) Maturity: Spark maturing, Hadoop map-reduce mature

55
Q

Spark vs. map-reduce performance

A

1) Spark can process data in-memory
2) Spark generally outperforms map-reduce, but needs a lot of memory to do so
3) Map-reduce easily runs alongside other services with minor performance differences and works well with the 1-pass jobs it was designed for

56
Q

4 limitations of Spark

A

1) For many simple use cases, map-reduce might be a more appropriate choice
2) Spark not designed for multi-user environment
3) Spark users are required to know that the memory they have is sufficient for a dataset
4) Adding more users adds complications, since the users will have to coordinate memory usage to run code

57
Q

Hadoop (HDFS, map-reduce)

A

Provides an easy solution for processing of Big Data. Brings a paradigm shift in programming distributed systems

58
Q

Spark and Big Data

A

Spark has extended map-reduce for in-memory computations. For streaming, interactive, iterative and machine learning tasks

59
Q

Non-stationary data

A

The distribution of the data changes over time

60
Q

6 applications of data streams

A

1) Mining query streams
2) Mining click streams
3) Mining social network news feeds
4) Sensor networks
5) Telephone call records
6) IP packets monitored at a switch

61
Q

Reservoir sampling

A

Sample a fixed proportion of elements in the stream

62
Q

Sliding windows

A

Useful model of stream processing is that queries are about a window of length N, the N most recent elements received.

63
Q

DGIM method is for when the stream is non-uniform

A

Method is never off by more than 50%. Idea is to summarize blocks with specific number of 1s: the block sizes increase exponentially, the size of the block is the number of 1s in the block, not it’s length in time

64
Q

A bucket in the DGIM method consists of:

A

1) Timestamp of its end
2) Number of 1s between its beginning and end

65
Q

Representing a stream by buckets - 4 things

A

1) Either 1 or 2 buckets with the same power of 2 number of 1s
2) Buckets don’t overlap in timestamps
3) Buckets are sorted by size
4) Buckets disappear when their end-time is > N time units in the past

66
Q

To estimate the number of 1s in the most recent N bits

A

1) Sum the sizes of all buckets but the last (note ‘size’ means the number of 1s in the bucket)
2) Add half the size of the last bucket

67
Q

Filtering data streams

A

Select elements of the stream that satisfy some property. Given a list of keys S, determine which tuples of stream are in S

68
Q

Bloom filter

A

Guarantees no false negatives and uses limited memory. Great for pre-processing before more expensive checks. Suitable for hardware implementation

69
Q

Flajolet-Martin approach

A

Gives an estimate that is ‘almost always’ close to the correct number, and picks a hash function h that maps each of the N elements to at least log2 N bits

70
Q

0th moment

A

Number of distinct elements

71
Q

1st moment

A

Count of the number of elements = length of the stream

72
Q

2nd moment = suprise number S

A

Measure of how uneven the distribution is

73
Q

AMS method

A

Works for all moments, and gives an unbiased estimate

74
Q

Supermarket shelf management

A

Market-basket model. Goal is to identify items that are bought together by sufficiently many customers

75
Q

Support for itemset I

A

Number of baskets containing all items in I. Often expressed as a fraction of the total number of baskets

76
Q

Association rules

A

If/then rules about the contents of baskets. There are many rules in practice, we focus on the significant/interesting ones

77
Q

Confidence

A

Probability of j given I

78
Q

Interest of an association rule

A

Difference between its confidence and the fraction of baskets that contain j

79
Q

Apriori

A

Generates larger candidate itemsets from smaller frequent itemsets - can use triangular matrix approach for pairs

80
Q

PCY

A

Uses idle memory in the first pass to eliminate more candidate pairs - needs table of triples for pairs

81
Q

Random sampling

A

Find all frequent itemsets in memory on a sample of data - misses out on (potentially many) frequent itemsets. Second pass can ensure that there are no false positives

82
Q

SON

A

Find all frequent itemsets in memory on each chunk of data. Second pass to count actual frequencies of all such itemsets on full data - ideal for distributed processing

83
Q

Naïve algorithm

A

Read file once, counting in main memory the occurrences of each pair. Fails if (#items)2 exceeds main memory

84
Q

Triangular matrix

A

Count all pairs using a matrix. Only requires 4 bytes per pair.

85
Q

Triples

A

Keep a table of triples [i, j, c] = “the count of the paris of items {i, j} is c”. Only uses 12 bytes per pair (but only for pairs with count > 0). This beats the Triangular Matrix approach, if less than 1/3 of possible pairs actually occur

86
Q

Apriori algorithm - more in depth

A

A two-pass approach called apriori limits the need for main memory. The key idea is monotonicity: if a set of items I appears in at least s times, so does every subset J of I. Contrapositive for pairs: if item I does not appear in S baskets, then no pair including i can appear in s baskets

87
Q

2 passes of apriori

A

1) Read baskets and count in main memory the occurences of each individual item
2) Read baskets again and count in main memory only those pairs where both elements are frequent

88
Q

PCY algorithm - more in depth

A

In pass 1 of apriori, most memory is idle. Pass 1 of PCY: in addition to item counts, maintain a hash table with as many buckets as fit in memory. Keep a count for each bucket into which pairs of items are hashed. For each bucket just keep the count, not the actual paris that hash to the bucket

89
Q

Random sampling - more in depth

A

Take a random sample of the market baskets. Run apriori or one of its improvement in main memory. So we don’t pay for disk I/O each time we increase the size of itemsets. Optionally, verify that the candidate pairs are truly frequent in the entire data set by a second pass (avoid false positives). But you don’t catch sets frequent in the whole but not in the sample

90
Q

SON - more in depth

A

An algorithm suited for distributed computation. Repeatedly read small subsets of the baskets into main memory and run an in-memory algorithm to find all frequent itemsets. An itemset becomes a candidate if it is found to be frequent in any one or more subsets of the baskets. Key monotonicity idea.