Lecture 3-5 Flashcards

Question 1

Q

What is the difference between MapReduce and Spark?

Answer

A

MapReduce writes all data that Mapppers send to file first. Spark attempts to avoid doing so, keeping as much data in RAM possible.

Question 2

Q

Resilient Distributed Dataset. - How does Spark address faul tolerance in distributed computation?

Answer

A

RDDs are either persistent datasets, or declarative recipes of how to create a dataset from other RDDs.

Question 3

Q

How can Spark be interpreted as generalization of MapReduce?

Answer

A

It offers 20+ more communication patterns. (for example joins between datasets)

Question 4

Q

What is lambda in Spark?

Answer

A

Implicit function definitions. A lambda is typically a parameter to high-level operators of Spark.

Question 5

Q

What is a DataFrame?

Answer

A

A DataFrame is a conceptual layer on top of RDDs. It is a table with column names and types. Thanks to this information, Spark can execute queries on DataFrames more efficiently.

Question 6

Q

What is MLLIB

Answer

A

Set of ready-to-run machine learning algorithms that work on RDDs.

Question 7

Q

OLAP (On Line Analytical Processing)

Answer

A

Users run a few complex read-only queries that search through huge tables.

Question 8

Q

Why are column stores are better than row-wise storage?

Answer

A

When data is compressed per column, value distribution is more regular than if values from different columns are mixed.

Question 9

Q

What is zone-map?

Answer

A

Storage trick. Keep simple statistics (Min, Max) for large tuple ranges. Avoid reading a zone in the table if WHERE condition ask for values that outside [Min, Max] range of the zone.

Question 10

Q

What are the goals of Distribution and Partitioning?

Answer

A

Distribution: Which machine gets which rows. Ensure that data gets split evenly. Partitioning: Often done on a time-related column. Speed up queries by skipping partitions. Second goal is data lifecycle management. -> Keep the last X days of data in X partitions, each day drop the oldest one.

Question 11

Q

ORC and Parquet

Answer

A

Columnar data storage. Data is placed column after column. Applications that read only few columns can skip over the unused columns.

Question 12

Q

What is the difference between database systems in the cloud and classic on-premise architectures?

Answer

A

Queries are not run on machines where the data is. Each database system will have to fetch data from a cloud storage service such as S3.

Question 13

Q

How can cloud database systems achieve locality?

Answer

A

If they are running on cloud instances with a disk. Such disks are for caching only, empty at start.

Question 14

Q

What is a serverless database system?

Answer

A

The scaling and start/stop of the database is fully handled by the system.

Question 15

Q

What is the problem with updating data on cloud? How to update data in cloud?

Answer

A

Each persistent update must write a new S3 file. Data may arrive all the time in small quantities/ This leads to very many small files. Solution: Batch data in the update pipeline. Only go to cloud when you have 100MB.

Question 16

Q

What is vectorization or just-in-time compilation for overhauling of SQL query engine?

Answer

Study These Flashcards

A

Take a bunch of rows and process them together. Columnar format is a must. Column is treated as an array and you iterate over that array.

Question 17

Q

Why is MapReduce a step backwards for database access?

Answer

Study These Flashcards

A

a sub-optimal implementation, in that it uses brute force instead of indexing;
not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago;
missing most of the features that are routinely included in current DBMS;
incompatible with all of the tools DBMS users have come to depend on.”

Lecture 3-5 Flashcards

The Spark Framework (17 cards)