Exam Flashcards

Question 1

Q

When does big data become too big? 3 things

Answer

A

1) When analysis become too slow or too unreliable
2) When systems become unresponsive
3) When day-to-day businesses are impacted

Question 2

Q

What are the 3 biggest changes in Big Data?

Answer

A

1) In the past storage was expensive
2) Only crucial data was preserved
3) Companies only consulted historical data

Question 3

Q

4 examples of data analysis that are computationally intensive and expensive

Answer

A

1) Online recommender systems
2) Frequent pattern mining
3) Multi-label classification
4) Subspace clustering

Question 4

Q

3 aspects of big data

Answer

A

1) Volume: quantity
2) Variety: different types of data
3) Velocity: speed at which new data is coming in

Question 5

Q

2 solutions for big data

Answer

A

1) Invest in hardware
2) Use intelligent algorithms

Question 6

Q

Goal of parallel computing

Answer

A

Leveraging the full potential of your multicore multicomputer system. The goal is to reduce computation time

Question 7

Q

Embarrassingly parallel

Answer

A

When an algorithm is split into smaller parts that can run in chunks simultaneously

Question 8

Q

Linear speedup

Answer

A

Executing two tasks in parallel on two cores, should have the running time

Question 9

Q

Task parallelism

Answer

A

Multiple tasks are applied on the same data in parallel

Question 10

Q

Data parallelism

Answer

A

A calculation is performed in parallel on many different data chunks

Question 11

Q

Data dependencies

Answer

A

These prevent you to realize a linear speedup. The input of a segment of code depends on the output of another piece of code

Question 12

Q

Distributed systems from a computational perspective

Answer

A

To solve a common problem, multiple systems work together using messages over the network to coordinate their actions

Question 13

Q

Distributed systems from a service perspective

Answer

A

Multiple systems are available over the internet to deliver services to vast amount of users. It’s used to spread the load, but also to guarantee a certain level of availability

Question 14

Q

5 characteristics of distributed hardware

Answer

A

1) Heterogenous hardware
2) Multiple systems work together using messages over the network to coordinate their actions
3) For the user it looks like one single system (transparant)
4) Tolerant to failures of one system
5) Scalable: it should be easy to add new nodes to increase computational capacity

Question 15

Q

Difference between parallel and distributed computing

Answer

A

Parallel computing typically requires one computer with multiple processors. Distributed computing involves several autonomous computer systems working on divided tasks

Question 16

Q

Definition of service

Answer

A

A single-task online operation with a request/response model. Availability is important. Load balancer distributes requests to different service instances each hosted on different nodes

Question 17

Q

Batch processing

Answer

A

An offline operation that takes a large amount of input data, runs a job to process it and produces some output data. Throughput is important, the time needed to process a dataset of a certain size

Question 18

Q

Stream processing

Answer

A

Offers near real-time processing somewhere between batch processing and online services. It’s not request driven, but consumes inputs and produces output. The idea is to process events shortly after they happen. Low delay is important

Question 19

Q

What does a cluster architecture looks like?

Answer

A

CPU, memory and disk

Question 20

Q

3 challenges with large-scale computing for data mining

Answer

A

1) How to distribute computation?
2) How to make it easy to write distributed programs?
3) Machines fail

Question 21

Q

Distributed file system

Answer

A

Provides global file namespace. Huge files (100s of GB to TB), data is rarely updated in place, reads and appends are common

Question 22

Q

4 characteristics of DFS (distributed file systems)

Answer

A

1) Chunk servers: file is split into contiguous chunks, each chunk is replicated for reliability, replicas are in different racks
2) Master node (name node): master node assigns map and reduce tasks to worker nodes, stores metadata about where files are stored, might be replicated
3) Client library for file access: talks to master to find chunker servers, connects directly to chunk servers to access data
4) Reliable distributed file system

Question 23

Q

Map-reduce: what does the programmer and the environment do?

Answer

A

Programmer: Map and Reduce and input files
Environment: partitioning, scheduling, node failures, etc.

Question 24

Q

Workflow Map-Reduce

Answer

A

1) Read inputs as a set of key-value pairs
2) Map transforms input kv-paris into a new set of k’v’-pairs
3) Sorts & shuffles the k’v’-pairs to output nodes
4) All k’v’-pairs with a given k’ are sent to the same reduce
5) Reduce processes all k’v’-pairs grouped by key into new k’‘v’‘-pairs
6) Write the resulting pairs to files

Question 25

Q

3 types of Map Reduce nodes

Answer

A

1) Master node: coordination and dealing with node failures
2) Worker nodes: chunk servers, map tasks, reduce tasks
3) Dynamic load balancing: ideally a lot more Map tasks than machines, spawn copies of in-progress tasks at the end of a phase

Question 26

Q

Data flow

Answer

A

Input and final output are stored in a DFS. The scheduler tries to schedule map tasks ‘close’ to physical storage location of input data. Intermediate results are stored on local FS of Map and Reduce workers. Output is often input to another Map-Reduce task

Question 27

Q

4 rules of thumb for Map and Reduce tasks

Answer

A

1) Make M much larger than the number of nodes in the cluster
2) One DFS chunk per map is common
3) Improves dynamic load balancing and speeds up recovery from worker failures
4) Usually R is smaller than M, because output is spread across R files

Question 28

Q

Backup tasks

Answer

A

Slow workers significantly lengthen the job completion time. So near end of phase, spawn backup copies of tasks. This would dramatically shorten job completion time

Question 29

Q

Combiners

Answer

A

Often a Map task will produce many pairs of the form (k, v1), (k, v2)… for the same key ‘k’, for example for popular words in the word count example. Combining can save network time by pre-aggregating values in the mapper. This only works if reduce function is commutative and associative

Question 30

Q

What are the 3 cost measures?

Answer

A

1) Communication costs
2) Elapsed communication costs
3) (Elapsed) computation costs

Question 31

Q

Relational Data Model

Answer

A

It uses the mathematical concept of a relation as the formalism for describing and representing data. Can be thought of as a table

Question 32

Q

How are rows and columns named in a relational data model?

Answer

A

Rows are called tuples, columns are called attributes, defined by attribute names

Question 33

Q

Cartesian product

Answer

A

Set of all possible combinations of those values

Question 34

Q

Relation schema

Answer

A

An expression R(A1, …, Ak) consisting of a relation name R and a sorted list (A1, …, Ak) of attributes. The relation is a set, so the order of the tuples does not matter, but the attributes must come in the same order in each tuple

Question 35

Q

Database schema

Answer

A

Set of relation schemas

Question 36

Q

Relational algebra

Answer

A

Is based on a few basic operators that take one or two relations as input and map them to a new relation. These then can be combined into more complex mappings from database instances to relations. Core of the query language SQL

Question 37

Q

Selection

Answer

A

An operator defined by a condition C that is applied to a single relation R. Select * from Drinkers where Drink = “Beer”

Question 38

Q

Projection

Answer

A

An operator that removes columns from a table. Select Person from Drinkers

Question 39

Q

Union

Answer

A

A set-theoretic operator applied to two relations R and S that share the same schema. Produces the set of all tuples present in at least one of the two relations

Question 40

Q

Intersection

Answer

A

A set-theoretic operator applied to two relations R and S that share the same schema. Produces the set of all tuples present in both R and S

Question 41

Q

Difference

Answer

A

A set-theoretic operator applied to two relations R and S that share the same schema. Produces the set of all tuples present in R, but not in S

Question 42

Q

Natural Join

Answer

A

An operator applied to two relations R and S. The set of attributes of the resulting relation is an union of the sets of attributes of R and S. So Drinkers(Person, Drink) and Bars(Bar, Drink) are combined into a relation of all possible combinations of persons and bars such that the person drinks a drink that is served in that bar

Question 43

Q

Grouping and aggregation

Answer

A

Produces some kind of aggregate of a particular attribute for each group of values of another attribute

Question 44

Q

Importance of relational algebra

Answer

A

It’s at the core of data analysis. Operators/concepts are perfectly applicable to more flexibly stored data. Relational algebra is also implementable in Map Reduce, it is naturally parallelizable and has massive runtime gains

Question 45

Q

Matrix-vector multiplication using map reduce

Answer

A

1) Assume that main matrix does not fit in memory
2) The vector fits in memory and every worker has a copy
3) The elements of the matrix are saved in a CSV file

Map > shuffle > reduce

Question 46

Q

Matrix-vector multiplication example

Answer

A

Two matrices A = (aij) with dimensions p x q and B = (bjk) with dimensions q x r.
C = A X b. C = (cik) with dimensions p x r

Question 47

Q

Conditions for sorting numbers using map-reduce

Answer

A

The range and distribution of numbers have to be known.

Question 48

Q

2 shortcomings of map-reduce

Answer

A

1) Inflexible: forces data processing into map and reduce steps
2) Based on acyclic data flow from Disk to Disk (HDFS), so you need to write to Disk before/after map and reduce. Not efficient for iterative tasks

Question 49

Q

In what ways does spark extend the map-reduce framework?

Answer

A

1) It generalizes to more complex Directed Acyclic Graphs (DAGs) that are optimized by Spark as it has a global view of the process
2) It abstract data into an object called Resilient Distributed Dataset (RDD), you can cache it
3) For ML algorithms that do multiple iterations over the dataset, this is much faster (more work in memory and fewer disk accesses)
4) Spark is more flexible in workflow (not just map and reduce phases) and data format (not only key-value pairs)

Question 50

Q

Resilient Distributed Dataset RDD plus 5 characteristics

Answer

A

Data containers which transform Spark
- You can mix different kinds of transformations to create new RDDs
- Created by parallelizing a collection or reading a file or by transforming other RDDs
- Fault tolerant
- Flexible
- Memory instead of disk

Question 51

Q

Spark operations

Answer

A

Many more than just map and reduce
1) Transformations
2) Actions

Question 52

Q

Spark transformations

Answer

A

Map, filter, sample, groupByKey, reduceByKey, sortByKey, intersection, flatMap, union, join, etc.

Question 53

Q

Spark actions

Answer

A

Reduce, count, collect, first, foreach, etc.

Question 54

Q

Spark vs. map-reduce - 4 things

Answer

A

1) Spark can process data in-memory
2) Ease of use: Spark is easier to program
3) Data processing: Spark is more general
4) Maturity: Spark maturing, Hadoop map-reduce mature

Question 55

Q

Spark vs. map-reduce performance

Answer

A

1) Spark can process data in-memory
2) Spark generally outperforms map-reduce, but needs a lot of memory to do so
3) Map-reduce easily runs alongside other services with minor performance differences and works well with the 1-pass jobs it was designed for

Question 56

Q

4 limitations of Spark

Answer

A

1) For many simple use cases, map-reduce might be a more appropriate choice
2) Spark not designed for multi-user environment
3) Spark users are required to know that the memory they have is sufficient for a dataset
4) Adding more users adds complications, since the users will have to coordinate memory usage to run code

Question 57

Q

Hadoop (HDFS, map-reduce)

Answer

A

Provides an easy solution for processing of Big Data. Brings a paradigm shift in programming distributed systems

Question 58

Q

Spark and Big Data

Answer

A

Spark has extended map-reduce for in-memory computations. For streaming, interactive, iterative and machine learning tasks

Question 59

Q

Non-stationary data

Answer

A

The distribution of the data changes over time

Question 60

Q

6 applications of data streams

Answer

A

1) Mining query streams
2) Mining click streams
3) Mining social network news feeds
4) Sensor networks
5) Telephone call records
6) IP packets monitored at a switch

Question 61

Q

Reservoir sampling

Answer

A

Sample a fixed proportion of elements in the stream

Question 62

Q

Sliding windows

Answer

A

Useful model of stream processing is that queries are about a window of length N, the N most recent elements received.

Question 63

Q

DGIM method is for when the stream is non-uniform

Answer

A

Method is never off by more than 50%. Idea is to summarize blocks with specific number of 1s: the block sizes increase exponentially, the size of the block is the number of 1s in the block, not it’s length in time

Question 64

Q

A bucket in the DGIM method consists of:

Answer

A

1) Timestamp of its end
2) Number of 1s between its beginning and end

Answer 65

A

1) Either 1 or 2 buckets with the same power of 2 number of 1s
2) Buckets don’t overlap in timestamps
3) Buckets are sorted by size
4) Buckets disappear when their end-time is > N time units in the past

Answer 66

A

1) Sum the sizes of all buckets but the last (note ‘size’ means the number of 1s in the bucket)
2) Add half the size of the last bucket

Answer 67

A

Select elements of the stream that satisfy some property. Given a list of keys S, determine which tuples of stream are in S

Answer 68

A

Guarantees no false negatives and uses limited memory. Great for pre-processing before more expensive checks. Suitable for hardware implementation

Answer 69

A

Gives an estimate that is ‘almost always’ close to the correct number, and picks a hash function h that maps each of the N elements to at least log2 N bits

Answer 70

A

Number of distinct elements

Answer 71

A

Count of the number of elements = length of the stream

Answer 72

A

Measure of how uneven the distribution is

Answer 73

A

Works for all moments, and gives an unbiased estimate

Answer 74

A

Market-basket model. Goal is to identify items that are bought together by sufficiently many customers

Answer 75

A

Number of baskets containing all items in I. Often expressed as a fraction of the total number of baskets

Answer 76

A

If/then rules about the contents of baskets. There are many rules in practice, we focus on the significant/interesting ones

Answer 77

A

Probability of j given I

Answer 78

A

Difference between its confidence and the fraction of baskets that contain j

Answer 79

A

Generates larger candidate itemsets from smaller frequent itemsets - can use triangular matrix approach for pairs

Answer 80

A

Uses idle memory in the first pass to eliminate more candidate pairs - needs table of triples for pairs

Answer 81

A

Find all frequent itemsets in memory on a sample of data - misses out on (potentially many) frequent itemsets. Second pass can ensure that there are no false positives

Answer 82

A

Find all frequent itemsets in memory on each chunk of data. Second pass to count actual frequencies of all such itemsets on full data - ideal for distributed processing

Answer 83

A

Read file once, counting in main memory the occurrences of each pair. Fails if (#items)2 exceeds main memory

Answer 84

A

Count all pairs using a matrix. Only requires 4 bytes per pair.

Answer 85

A

Keep a table of triples [i, j, c] = “the count of the paris of items {i, j} is c”. Only uses 12 bytes per pair (but only for pairs with count > 0). This beats the Triangular Matrix approach, if less than 1/3 of possible pairs actually occur

Answer 86

A

A two-pass approach called apriori limits the need for main memory. The key idea is monotonicity: if a set of items I appears in at least s times, so does every subset J of I. Contrapositive for pairs: if item I does not appear in S baskets, then no pair including i can appear in s baskets

Answer 87

A

1) Read baskets and count in main memory the occurences of each individual item
2) Read baskets again and count in main memory only those pairs where both elements are frequent

Answer 88

A

In pass 1 of apriori, most memory is idle. Pass 1 of PCY: in addition to item counts, maintain a hash table with as many buckets as fit in memory. Keep a count for each bucket into which pairs of items are hashed. For each bucket just keep the count, not the actual paris that hash to the bucket

Answer 89

A

Take a random sample of the market baskets. Run apriori or one of its improvement in main memory. So we don’t pay for disk I/O each time we increase the size of itemsets. Optionally, verify that the candidate pairs are truly frequent in the entire data set by a second pass (avoid false positives). But you don’t catch sets frequent in the whole but not in the sample

Answer 90

A

An algorithm suited for distributed computation. Repeatedly read small subsets of the baskets into main memory and run an in-memory algorithm to find all frequent itemsets. An itemset becomes a candidate if it is found to be frequent in any one or more subsets of the baskets. Key monotonicity idea.

Brainscape's Knowledge GenomeTM

Exam Flashcards

Brainscape's Knowledge Genome^TM