Lecture 1 Flashcards

Big Data & Cloud Computing

1
Q

three V of big data

A

(i) Volume: the data is large, (2) Variety: the data is often not clean and tabular, but messy (text, even images), (3) Velocity: new data keeps arriving continuously.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Power Law in Big Data

  • What is the problem caused by power law in parallel computing?
A

The mass of data is in the long tail. The long tail cannot be ignored as it may represent the majority of all datapoints.

Head might go into one or two nodes, tail might spread over all other nodes, these laws can turn parallel algorithms into sequential ones. All workers on the tail would finish quickly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

CAP theorem

A

A global software system cannot achieve all three of Consistency (reads always reflect the latest updates), Availability (the system is always up), Partitionability (the system is resistant against loss of communication between data centers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Replication and sharding

A

Replication is a form of clustering where all nodes in the cluster have the same/identical schema and data. In sharing, all nodes in the cluster have identical schema, however the data is divided across nodes such that each node has only a subset of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Three areas in large computer infrastructures

A

Super Computing (performance is king), Cluster Computing (quickly get things done on a large cluster of unreliable machines), Cloud Computing (operated by third party, sold as a service to users based on their actual use, elastic)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

S3 block storage

A

Infinitely scaling storage. Implemented by spreading storage with replication over hundreds of thousands of machines Amazon owns. Stores data as objects in a flat environment. Each object contains a header with an associated sequence of bytes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

EBS filesystem

A

Virtual disks that store their data remotely on S3 but do a lot of caching to a void network traffic. Can only be used with EC2. Storage for virtual machine.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

EC2 service

A

Allows to power up virtual machines in the cloud.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly