Apache Spark Programming with Databricks Flashcards
Databricks is a
SAAS company created to make big data and AI simple for organizations. Databricks offers a Unified Data Analytics Platform, which is an online environment where data practitioners can collaborate on data science projects and workflows.
Spark is a
unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. It’s the de-facto standard unified analytics engine and largest open-source project in data processing. This technology was created by the founders of Databricks.
Spark uses clusters of machines to
process big data by breaking a large task into smaller ones and distributing the work among several machines.
Apache Parquet
Apache Parquet is a columnar storage format that provides compressed, efficient columnar data representation. Unlike CSV, parquet allows you load in only the columns you need, since the values for a single record are not stored together.
Delta Lake is
Delta Lake is an open source technology designed to work with Spark to bring reliability to data lakes. Delta Lake runs on top of your existing data lake to provide ACID transactions, scalable metadata handling, and unified streaming and batch processing.
Streaming Concepts
In batch processing, computation runs on a fixed-input dataset. Stream processing is the act of continuously incorporating new data to compute a result. In stream processing, the input data is unbounded and has no predetermined beginning or end.