2. Cloud Flashcards
What is cloud computing?
Cloud Computing is Computing in the Internet
What is the fat tree design?
Network design for datacenter:
- Three tier design: Edge, Aggregation, Core
- Defined by single parameter k = number of ports on a switch
- All layers use the same switch
- Supports k³/4 hosts
- High redundancy: k*k/4 paths between two endpoints
What is the jellyfish network design?
Forget network structure and use random connections:
- Each 4L ports switch connects to
– L hosts
– 3L other random switches
What is the CAP theorem?
In a distributed system you can satisfy at most two out of the following three properties:
1. Consistency: all nodes have same data at any time
2. Availability: the system allows operations all the time
3. Partition-tolerance: the system continues to work in spite of network
partitions
How does Cassandra handle the CAP theorem?
Weak consistency
What are the characteristics of Cassandra?
- Key-Value Pair Storage
- “No-SQL”
- Supports get(key) and put(key,value) operations
How is data stored in Cassandra?
- Key-value pair
- Nodes form a ring and key is hashed to determine the location (DHT)
- Similar to chord
- Replicated on n nodes
What are the replica policies in Cassandra?
- Rack Unaware: replicate data at n-1 successive nodes
- Rack Aware: coordinator tells nodes the range they are replicas for
- Datacenter Aware: same as rack aware, but on datacenter level
How does a write operation in Cassandra work?
- Partitioner of the node determines the node responsible (hash function)
- Log it to disk commit log
- Modify memtables
- When memtables are old or full, flush to disk
– Datafile, Indexfile
How do Bloom filters work and what are they used for in Cassandra?
Bloom filter: Bit map and a set of hash functions.
- Use the set of hash functions to create a fingerprint for a given key:
– h(x) = y -> BIT[y] = 1
- is used to check if data is present on a node
- might create false positives
How is a delete operation done in Cassandra?
- Don’t delete item right away
- Add tombstone to item
How is a read operation done in Cassandra?
- Fetch data from closest replica
- Also fetch multiple other replicas
– If data differs init read-repair
How is the potential speed-up of parallelization computed?
- Amdahls formula (upper bound):
n = number of processors
p = portion of the program that is parallelizable
S = 1 / ((1-p) + p/n)
Describe the two methods of parallelization in cloud computing
Request Level Parallelism (RLP):
- Concurrent processing of multiple requests: e.g. Google
– Distribute indexing, images, documents, ads, … to multiple nodes
Data Level Parallelism (DLP):
- Concurrent processing of multiple data: e.g. MapReduce
– Distribute data with map and reduce nodes
Explain the main principle of MapReduce
- Data in key-value format
- Chunk of data is processed by Mapper (mapping function) to Intermediate Output
- Intermediate Output is assigned by Partitioner to Reducer (reduce funciton)
– Same Intermediate key -> same reducer - Reducer produces final output