Lecture 3-5 Flashcards
The Spark Framework
What is the difference between MapReduce and Spark?
MapReduce writes all data that Mapppers send to file first. Spark attempts to avoid doing so, keeping as much data in RAM possible.
Resilient Distributed Dataset. - How does Spark address faul tolerance in distributed computation?
RDDs are either persistent datasets, or declarative recipes of how to create a dataset from other RDDs.
How can Spark be interpreted as generalization of MapReduce?
It offers 20+ more communication patterns. (for example joins between datasets)
What is lambda in Spark?
Implicit function definitions. A lambda is typically a parameter to high-level operators of Spark.
What is a DataFrame?
A DataFrame is a conceptual layer on top of RDDs. It is a table with column names and types. Thanks to this information, Spark can execute queries on DataFrames more efficiently.
What is MLLIB
Set of ready-to-run machine learning algorithms that work on RDDs.
OLAP (On Line Analytical Processing)
Users run a few complex read-only queries that search through huge tables.
Why are column stores are better than row-wise storage?
When data is compressed per column, value distribution is more regular than if values from different columns are mixed.
What is zone-map?
Storage trick. Keep simple statistics (Min, Max) for large tuple ranges. Avoid reading a zone in the table if WHERE condition ask for values that outside [Min, Max] range of the zone.
What are the goals of Distribution and Partitioning?
Distribution: Which machine gets which rows. Ensure that data gets split evenly. Partitioning: Often done on a time-related column. Speed up queries by skipping partitions. Second goal is data lifecycle management. -> Keep the last X days of data in X partitions, each day drop the oldest one.
ORC and Parquet
Columnar data storage. Data is placed column after column. Applications that read only few columns can skip over the unused columns.
What is the difference between database systems in the cloud and classic on-premise architectures?
Queries are not run on machines where the data is. Each database system will have to fetch data from a cloud storage service such as S3.
How can cloud database systems achieve locality?
If they are running on cloud instances with a disk. Such disks are for caching only, empty at start.
What is a serverless database system?
The scaling and start/stop of the database is fully handled by the system.
What is the problem with updating data on cloud? How to update data in cloud?
Each persistent update must write a new S3 file. Data may arrive all the time in small quantities/ This leads to very many small files. Solution: Batch data in the update pipeline. Only go to cloud when you have 100MB.