Lecture 3-5 Flashcards

The Spark Framework

1
Q

What is the difference between MapReduce and Spark?

A

MapReduce writes all data that Mapppers send to file first. Spark attempts to avoid doing so, keeping as much data in RAM possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Resilient Distributed Dataset. - How does Spark address faul tolerance in distributed computation?

A

RDDs are either persistent datasets, or declarative recipes of how to create a dataset from other RDDs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can Spark be interpreted as generalization of MapReduce?

A

It offers 20+ more communication patterns. (for example joins between datasets)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is lambda in Spark?

A

Implicit function definitions. A lambda is typically a parameter to high-level operators of Spark.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a DataFrame?

A

A DataFrame is a conceptual layer on top of RDDs. It is a table with column names and types. Thanks to this information, Spark can execute queries on DataFrames more efficiently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is MLLIB

A

Set of ready-to-run machine learning algorithms that work on RDDs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

OLAP (On Line Analytical Processing)

A

Users run a few complex read-only queries that search through huge tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why are column stores are better than row-wise storage?

A

When data is compressed per column, value distribution is more regular than if values from different columns are mixed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is zone-map?

A

Storage trick. Keep simple statistics (Min, Max) for large tuple ranges. Avoid reading a zone in the table if WHERE condition ask for values that outside [Min, Max] range of the zone.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the goals of Distribution and Partitioning?

A

Distribution: Which machine gets which rows. Ensure that data gets split evenly. Partitioning: Often done on a time-related column. Speed up queries by skipping partitions. Second goal is data lifecycle management. -> Keep the last X days of data in X partitions, each day drop the oldest one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

ORC and Parquet

A

Columnar data storage. Data is placed column after column. Applications that read only few columns can skip over the unused columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the difference between database systems in the cloud and classic on-premise architectures?

A

Queries are not run on machines where the data is. Each database system will have to fetch data from a cloud storage service such as S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can cloud database systems achieve locality?

A

If they are running on cloud instances with a disk. Such disks are for caching only, empty at start.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a serverless database system?

A

The scaling and start/stop of the database is fully handled by the system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the problem with updating data on cloud? How to update data in cloud?

A

Each persistent update must write a new S3 file. Data may arrive all the time in small quantities/ This leads to very many small files. Solution: Batch data in the update pipeline. Only go to cloud when you have 100MB.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is vectorization or just-in-time compilation for overhauling of SQL query engine?

A

Take a bunch of rows and process them together. Columnar format is a must. Column is treated as an array and you iterate over that array.

17
Q

Why is MapReduce a step backwards for database access?

A
  • a sub-optimal implementation, in that it uses brute force instead of indexing;
  • not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago;
  • missing most of the features that are routinely included in current DBMS;
  • incompatible with all of the tools DBMS users have come to depend on.”