Spark and DataBricks Flashcards
Can you explain the design schemas relevant to data modeling?
There are three data modeling design schemas: Star, Snowflake, and Galaxy.
The star schema contains various dimension tables which are connected to that fact table in the center.
Snowflake is the extension of the star schema. It consists of a fact table and dimension tables with snowflake-like layers.
The Galaxy schema contains two fact tables, and it shares dimension tables between them.
Why do data systems require a disaster recovery plan?
Disaster recovery planning involves real-time backing up of files and media. The backup storage will be used to restore files in case of a cyber-attack or equipment failure. Security protocols are placed to monitor, trace, and restrict both incoming and outgoing traffic.
What is data orchestration, and what tools can you use to perform it?
Data orchestration is an automated process for accessing raw data from multiple sources, performing data cleaning, transformation, and modeling techniques, and serving it for analytical tasks. The most popular tools are Apache Airflow, Prefect, Dagster, and AWS Glue.
What issues does Apache Airflow resolve?
Apache Airflow allows you to manage and schedule pipelines for the analytical workflow, data warehouse management, and data transformation and modeling under one roof.
You can monitor execution logs in one place, and callbacks can be used to send failure alerts to Slack and Discord. Finally, it is easy to use, provides a helpful user interface and robust integrations, and is free to use.
What are the various modes in Hadoop?
Hadoop mainly works on 3 modes:
Standalone Mode: it is used for debugging where you don’t use HDFS. It uses a local file system for input and output.
Pseudo-distributed Mode: consists of a single node cluster where NameNode and Data node reside at the same place. It is mainly used for testing purposes.
Fully-Distributed Mode: it is a production-ready mode where multiple clusters are running. The data is distributed across multiple nodes. It has separate nodes for master and slave daemons.
What are the three V’s of big data?
Volume (of data)
Velocity (how fast it’s coming in)
Variety (diversity of structure and content)
Additional V’s:
Veracity (accuracy, trustworthiness)
Value
Validity
Visualisation
Variability
Vulnerability
Visibility
Volatility
What is the definition of big data?
Depends on situation, but typically any of:
- > 100TB
- Requires parallel processing
- Too large for operational databases
- Requires big data technology (even if it’s ‘small’ data)
What is data gravity?
Lots of data on single cloud platform:
- More value
- Harder to move data
What is Map Reduce?
MapReduce is a programming model or pattern within the Hadoop framework that is used to access big data stored in the Hadoop File System (HDFS).
- Split single large dataset into multiple smaller datasets
- Each dataset is sent to a node in compute cluster (called mapper)
- Mapper converts data to key-value format, processes and puts in series of output files
- Data is collated by key - all data for given key is put in same file - can put different keys into same file, but never split key across files
- Files are sent to other nodes in cluster called reducer nodes
- Reducers reduce the series of values for each key into a single value (aggregation)
- Outputs are combined into single output for the job
What is Massively Parallel Processing?
Massively parallel is the term for using a large number of computer processors to simultaneously perform a set of coordinated computations in parallel.
GPUs are massively parallel architecture with tens of thousands of threads.
- User submits single SQL query to data warehouse (cluster) master node
- Master node takes SQL query and breaks down into sub-queries which are sent to each worker node
- Worker nodes execute sub-queries (all sharing same data and storage), and all queries are executed in parallel
- Worker node results sent to master node and combined into single result which is sent to user
What is the difference between ETL and ELT pipelines?
ETL = Extract, Transform, Load
- Traditional warehousing approach
- Load and transform data in memory
ELT = Extract, Load, Transform
- Move data to destination first
- More efficient processing at destination
- More resilient (separation of data moving and processing)
What is Data Virtualisation?
Combine and transform data sources without physically modifying data (leave data where it is)
- Good when too many data sources for ETL/ELT to be sustainable
- Good when data movement too expensive
- Good for highly regulated data
- Federated querying (multiple data sources) is possible: connectivity to multiple backends
What is Spark SQL?
Allows developers to write declarative code in Spark jobs
- Abstracts out distributed nature
- Is to Spark what HIVE is to Hadoop; but MUCH faster than HIVE and easier to unit test
- Creates dataframes as containers for resulting data: same structures used for Spark Streaming and Spark ML (can mix and match jobs)
Compatible with multiple data sources: HIVE, JSON, CSV, Parquet etc
Additional optimisations:
- Predicate pushdown
- Column pruning
- Uniform API
- Code generation (performance gains, esp. for Python)
- Can hop in and out of RDDs and SQL as needed
What is predicate pushdown?
Parts of SQL queries that filter data are called ‘predicates’
A predicate push down filters the data in the database query, reducing the number of entries retrieved from the database and improving query performance. By default the Spark Dataset API will automatically push down valid WHERE clauses to the database
What is column pruning in Spark SQL?
An analyser decides if only a subset of columns are required for the output and drops unnecessary columns
What is Apache Parquet?
The data lake format of choice
- Stores data in columns
- Efficient for querying
- Enables compression
- Easy partitioning
What is PrestoDB?
MPP SQL on anything and data virtualisation engine (no storage engine)
- Displacing HIVE
- Increasingly popular for data lakes
- Functions like a data warehouse, but without storage
- Connects to multiple back end data sources
- Blurs lines between data lakes and warehouses
What is Apache Kafka?
Event streaming engine
- Uses message queue paradigm to model streaming data through ‘topics’
What is cluster computing?
A collection of servers (nodes) that are federated and can operate together
- One Driver node and multiple Worker nodes
- Apps talk to Driver, which controls Workers
- Workers parallelise the work (horizonal scaling)
- Designed for failure - redundancy and fault tolerance
What are Containers?
‘deployment packages’ or ‘lightweight virtual machines’
In contrast to virtual machines which are digital images of entire computers, containers only contain the software required for a specific piece of software (no OS etc)
- Much faster than VMs
- Can deploy groups to orchestrate together
- Portable between cloud/on prem etc
What are container orchestration (cluster manager) options for Spark?
Cluster manager: oversees multiple processes
- Spark Standalone: built in manager
- YARN: Hadoop manager
- Mesos: Comparable to YARN but more flexible
- Kubernetes: recently added
What are the key benefits of Spark vs Hadoop?
- Increased efficiency (less machines for same results as Hadoop)
- Much faster
- Less code (generalised abstractions)
- Caches data in memory
- Abstracts away distributed nature (can write code ignoring this)
- Interactive (can play with data on the fly)
- Fault tolerance
- Unify big data needs (answer to MapReduce explosion)
What is DataBricks relationship to Spark?
- Founded by Spark creators
- Maintain Spark repo and ecosystem
What languages can you use for Spark?
Spark is written in Scala and this is it’s native language
Java and Python can also be used
Python API mirrors Scala most closely
What is an RDD?
Resilient Distributed Dataset
- Low-level API
- The most basic data abstraction in Spark
- Collection of elements (similar to list/array) partitioned across nodes of the cluster
- Can be operated on in parallel
- Immutable - but this is just the lineage, not the data driving it
- Resilient:
- Any point of failure doesn’t affect all data and can be fixed
- Each RDD knows how it was built allowing it to choose best path for recovery
What are the two categories of core operation in Spark?
Transformations and Actions
Transformations - instructions for RDD modification; comprise DAG (e.g. map, filter)
Actions - instructions to trigger execution of DAG (e.g. collect, count, reduce). Usually result in data transfer back to Driver
What are the qualities of Spark Transformations?
- Lazily evaluated (only intent is stored)
- Triggered by Actions
- Combine to form DAG graphs
What is a DAG?
Directed Acyclic Graph
Vertices represent Resilient distributed systems (RDD)
Edges represent the Operation which is to be applied on RDD (transformation or action)
Ends up as functional lineage that is sent to worker nodes (handles faults easily)
What are the three memory loading input data methods in Spark?
parallelize, range and makeRDD
What is the underlying method for most file ingestion methods in Spark?
hadoopFile - this handles any Hadoop supported file format
How does Spark use Lambda functions?
Lambdas are anonymous functions
Most Spark functions use lambdas
e.g.
.filter(wikiToken = wikiToken.len > 2)