00 intro: General info Flashcards
Random access
Random access refers to the ability to access any particular piece of data from a storage device directly, without the need to sequentially read through the entire storage. It allows for immediate retrieval of data from any location within the storage, regardless of the order in which the data is stored.
What does “Big Data” look like?
CSV, TSV and JSONs files, web pages, graphs, twitter tweets, server access logs,
The 4 Big V’s of “Big Data”
Volume: Lots of data
Velocity: Changing / growing data
Variety: Heterogeneity of data
Verity: Correct / true or not?
Scale-Out vs Scale-Up
Scale-Out:
use of hundreds, thousands small machines vs
Scale-Up:
a single, rather powerful server
if P = failures of a single machine during a certain period of time then probability of N machine at the same time?
P_n = 1 - ( 1 - P) ^ N
Fallacies of Distributed Computing
- Reliablity of network
- Latency
- Bandwidth is infinite
- Security of network
- Topology does not change
- Administrator is only one user
- Transport cost is zero
- Homogeneous network
name few Cloud Computing Platforms
- Amazon Elastic Cloud 2 (EC2)
- Microsoft Azure
- Google Cloud Platform (GCP)
What is MapReduce?
Map Phase:
in a parallel and distributed manner stored in memory, that divided data and apply mapping function creating
key-value pairs
Reduce Phase:
key-value pairs grouped based on their keys, creating aggregates, summarizes, or other computation.
The output of the reduce tasks is typically written to a file or storage system.
Apache Hadoop
is a popular open-source implementation of the MapReduce model, providing a scalable and reliable framework for distributed data processing.
what is Spark
In addition to simple MapReduce operations, Spark supports SQL
queries, streaming data, and complex analytics such as machine
learning and graph algorithms out-of-the-box.