Introduction Flashcards

Question 1

Q

What language is Spark written in?

Answer

A

SCALA which is based on JAVA

Question 2

Q

Cluster Manager

Answer

A

Acquire’s resources, worker nodes, executors and task required to perform the work.

Question 3

Q

Partition

Answer

A

Large task are split into chunks to be sent to a different node for processing.

Question 4

Q

Executors

Answer

A

Executors are contained within each node and perform task to work in parallel with eachother. Each executor uses a seperate Java Virtual Machine

Question 5

Q

worker node

Answer

A

Any node that can run application code in the cluster.

Question 6

Q

Task

Answer

A

A unit of work that will be sent to one executor

Question 7

Q

Job

Answer

A

A Parallel computation consisting of multiple task that gets spawned in response to a Spark action.

Question 8

Q

Stage

Question 9

Q

Cluster Managers

Answer

A

Program that controls the how the cluster processes data.

Question 10

Q

Spark Standalone

Answer

A

A basic built-in cluster manager.

Question 11

Q

Apache Mesos

Answer

A

A general cluster manager that can also run Hadoop MapReduce and service applications.

Question 12

Q

Hadoop Yarn

Answer

A

The resource manager used in Hadoop 2

Question 13

Q

Kubernetes

Answer

A

An open-source service for automating deployment, scaling, and management of containerized applications.

Question 14

Q

What are the 4 spark core services?

Answer

A

Spark SQL, Spark Streaming, MLIB Machine Learning, GraphX

Question 15

Q

What type of Dataframe does Spark use?

Answer

A

Spark SQL uses a distributed DataFrame

Question 16

Q

What is spark streaming?

Answer

Study These Flashcards

A

Spark Streaming offers realtime data processing that can take input’s from multiple sources, integrated with machine learning, then output to different data storage systems.

Question 17

Q

What is Spark MLIB

Answer

Study These Flashcards

A

Spark Machine Learning provides a set of tools for ML that are optimized to use with paralleized execution which enables processing of big data.

Question 18

Q

What is GraphX?

Answer

Study These Flashcards

A

GraphX is a library of tools used to traverse netwroks, display paths and visualize connections. (think relationship data between entities like flights, social networks )

Question 19

Q

What is DataBricks?

Answer

Study These Flashcards

A

A commercial, for-profit, company, founded by two of the creators of Apache Spark. The application provides a complete data engineering and science collaborative environment to develop spark applications

Introduction Flashcards

(19 cards)