Introduction Flashcards
What language is Spark written in?
SCALA which is based on JAVA
Cluster Manager
Acquire’s resources, worker nodes, executors and task required to perform the work.
Partition
Large task are split into chunks to be sent to a different node for processing.
Executors
Executors are contained within each node and perform task to work in parallel with eachother. Each executor uses a seperate Java Virtual Machine
worker node
Any node that can run application code in the cluster.
Task
A unit of work that will be sent to one executor
Job
A Parallel computation consisting of multiple task that gets spawned in response to a Spark action.
Stage
Cluster Managers
Program that controls the how the cluster processes data.
Spark Standalone
A basic built-in cluster manager.
Apache Mesos
A general cluster manager that can also run Hadoop MapReduce and service applications.
Hadoop Yarn
The resource manager used in Hadoop 2
Kubernetes
An open-source service for automating deployment, scaling, and management of containerized applications.
What are the 4 spark core services?
Spark SQL, Spark Streaming, MLIB Machine Learning, GraphX
What type of Dataframe does Spark use?
Spark SQL uses a distributed DataFrame
What is spark streaming?
Spark Streaming offers realtime data processing that can take input’s from multiple sources, integrated with machine learning, then output to different data storage systems.
What is Spark MLIB
Spark Machine Learning provides a set of tools for ML that are optimized to use with paralleized execution which enables processing of big data.
What is GraphX?
GraphX is a library of tools used to traverse netwroks, display paths and visualize connections. (think relationship data between entities like flights, social networks )
What is DataBricks?
A commercial, for-profit, company, founded by two of the creators of Apache Spark. The application provides a complete data engineering and science collaborative environment to develop spark applications