Lecture 1 Flashcards
Big Data & Cloud Computing
three V of big data
(i) Volume: the data is large, (2) Variety: the data is often not clean and tabular, but messy (text, even images), (3) Velocity: new data keeps arriving continuously.
Power Law in Big Data
- What is the problem caused by power law in parallel computing?
The mass of data is in the long tail. The long tail cannot be ignored as it may represent the majority of all datapoints.
Head might go into one or two nodes, tail might spread over all other nodes, these laws can turn parallel algorithms into sequential ones. All workers on the tail would finish quickly.
CAP theorem
A global software system cannot achieve all three of Consistency (reads always reflect the latest updates), Availability (the system is always up), Partitionability (the system is resistant against loss of communication between data centers)
Replication and sharding
Replication is a form of clustering where all nodes in the cluster have the same/identical schema and data. In sharing, all nodes in the cluster have identical schema, however the data is divided across nodes such that each node has only a subset of the data.
Three areas in large computer infrastructures
Super Computing (performance is king), Cluster Computing (quickly get things done on a large cluster of unreliable machines), Cloud Computing (operated by third party, sold as a service to users based on their actual use, elastic)
S3 block storage
Infinitely scaling storage. Implemented by spreading storage with replication over hundreds of thousands of machines Amazon owns. Stores data as objects in a flat environment. Each object contains a header with an associated sequence of bytes.
EBS filesystem
Virtual disks that store their data remotely on S3 but do a lot of caching to a void network traffic. Can only be used with EC2. Storage for virtual machine.
EC2 service
Allows to power up virtual machines in the cloud.