Exam Flashcards
When does big data become too big? 3 things
1) When analysis become too slow or too unreliable
2) When systems become unresponsive
3) When day-to-day businesses are impacted
What are the 3 biggest changes in Big Data?
1) In the past storage was expensive
2) Only crucial data was preserved
3) Companies only consulted historical data
4 examples of data analysis that are computationally intensive and expensive
1) Online recommender systems
2) Frequent pattern mining
3) Multi-label classification
4) Subspace clustering
3 aspects of big data
1) Volume: quantity
2) Variety: different types of data
3) Velocity: speed at which new data is coming in
2 solutions for big data
1) Invest in hardware
2) Use intelligent algorithms
Goal of parallel computing
Leveraging the full potential of your multicore multicomputer system. The goal is to reduce computation time
Embarrassingly parallel
When an algorithm is split into smaller parts that can run in chunks simultaneously
Linear speedup
Executing two tasks in parallel on two cores, should have the running time
Task parallelism
Multiple tasks are applied on the same data in parallel
Data parallelism
A calculation is performed in parallel on many different data chunks
Data dependencies
These prevent you to realize a linear speedup. The input of a segment of code depends on the output of another piece of code
Distributed systems from a computational perspective
To solve a common problem, multiple systems work together using messages over the network to coordinate their actions
Distributed systems from a service perspective
Multiple systems are available over the internet to deliver services to vast amount of users. It’s used to spread the load, but also to guarantee a certain level of availability
5 characteristics of distributed hardware
1) Heterogenous hardware
2) Multiple systems work together using messages over the network to coordinate their actions
3) For the user it looks like one single system (transparant)
4) Tolerant to failures of one system
5) Scalable: it should be easy to add new nodes to increase computational capacity
Difference between parallel and distributed computing
Parallel computing typically requires one computer with multiple processors. Distributed computing involves several autonomous computer systems working on divided tasks
Definition of service
A single-task online operation with a request/response model. Availability is important. Load balancer distributes requests to different service instances each hosted on different nodes
Batch processing
An offline operation that takes a large amount of input data, runs a job to process it and produces some output data. Throughput is important, the time needed to process a dataset of a certain size
Stream processing
Offers near real-time processing somewhere between batch processing and online services. It’s not request driven, but consumes inputs and produces output. The idea is to process events shortly after they happen. Low delay is important
What does a cluster architecture looks like?
CPU, memory and disk
3 challenges with large-scale computing for data mining
1) How to distribute computation?
2) How to make it easy to write distributed programs?
3) Machines fail
Distributed file system
Provides global file namespace. Huge files (100s of GB to TB), data is rarely updated in place, reads and appends are common
4 characteristics of DFS (distributed file systems)
1) Chunk servers: file is split into contiguous chunks, each chunk is replicated for reliability, replicas are in different racks
2) Master node (name node): master node assigns map and reduce tasks to worker nodes, stores metadata about where files are stored, might be replicated
3) Client library for file access: talks to master to find chunker servers, connects directly to chunk servers to access data
4) Reliable distributed file system
Map-reduce: what does the programmer and the environment do?
Programmer: Map and Reduce and input files
Environment: partitioning, scheduling, node failures, etc.
Workflow Map-Reduce
1) Read inputs as a set of key-value pairs
2) Map transforms input kv-paris into a new set of k’v’-pairs
3) Sorts & shuffles the k’v’-pairs to output nodes
4) All k’v’-pairs with a given k’ are sent to the same reduce
5) Reduce processes all k’v’-pairs grouped by key into new k’‘v’‘-pairs
6) Write the resulting pairs to files
3 types of Map Reduce nodes
1) Master node: coordination and dealing with node failures
2) Worker nodes: chunk servers, map tasks, reduce tasks
3) Dynamic load balancing: ideally a lot more Map tasks than machines, spawn copies of in-progress tasks at the end of a phase
Data flow
Input and final output are stored in a DFS. The scheduler tries to schedule map tasks ‘close’ to physical storage location of input data. Intermediate results are stored on local FS of Map and Reduce workers. Output is often input to another Map-Reduce task
4 rules of thumb for Map and Reduce tasks
1) Make M much larger than the number of nodes in the cluster
2) One DFS chunk per map is common
3) Improves dynamic load balancing and speeds up recovery from worker failures
4) Usually R is smaller than M, because output is spread across R files
Backup tasks
Slow workers significantly lengthen the job completion time. So near end of phase, spawn backup copies of tasks. This would dramatically shorten job completion time
Combiners
Often a Map task will produce many pairs of the form (k, v1), (k, v2)… for the same key ‘k’, for example for popular words in the word count example. Combining can save network time by pre-aggregating values in the mapper. This only works if reduce function is commutative and associative
What are the 3 cost measures?
1) Communication costs
2) Elapsed communication costs
3) (Elapsed) computation costs
Relational Data Model
It uses the mathematical concept of a relation as the formalism for describing and representing data. Can be thought of as a table
How are rows and columns named in a relational data model?
Rows are called tuples, columns are called attributes, defined by attribute names
Cartesian product
Set of all possible combinations of those values
Relation schema
An expression R(A1, …, Ak) consisting of a relation name R and a sorted list (A1, …, Ak) of attributes. The relation is a set, so the order of the tuples does not matter, but the attributes must come in the same order in each tuple
Database schema
Set of relation schemas
Relational algebra
Is based on a few basic operators that take one or two relations as input and map them to a new relation. These then can be combined into more complex mappings from database instances to relations. Core of the query language SQL