Apache Spark Programming with Databricks Flashcards

1
Q

Databricks is a

A

SAAS company created to make big data and AI simple for organizations. Databricks offers a Unified Data Analytics Platform, which is an online environment where data practitioners can collaborate on data science projects and workflows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Spark is a

A

unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. It’s the de-facto standard unified analytics engine and largest open-source project in data processing. This technology was created by the founders of Databricks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Spark uses clusters of machines to

A

process big data by breaking a large task into smaller ones and distributing the work among several machines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Apache Parquet

A

Apache Parquet is a columnar storage format that provides compressed, efficient columnar data representation. Unlike CSV, parquet allows you load in only the columns you need, since the values for a single record are not stored together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Delta Lake is

A

Delta Lake is an open source technology designed to work with Spark to bring reliability to data lakes. Delta Lake runs on top of your existing data lake to provide ACID transactions, scalable metadata handling, and unified streaming and batch processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Streaming Concepts

A

In batch processing, computation runs on a fixed-input dataset. Stream processing is the act of continuously incorporating new data to compute a result. In stream processing, the input data is unbounded and has no predetermined beginning or end.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly