Book Concepts Flashcards

Question 1

Q

What are the 4 key characteristics for Spark?

Answer

A

Speed
Ease of Use
Modularity
Extensibility

Question 2

Q

Why is Spark faster than its predecessors?

Answer

A

Because it does it intermediary calculation in-memory.

Question 3

Q

In-memory calculation is related with which Spark characteristic?

Question 4

Q

DAGs is related with which Spark characteristic?

Question 5

Q

Read data from different sources is related with which Spark characteristic?

Answer

A

Extensibility

Question 6

Q

The spark core modules is related with which Spark characteristic?

Answer

A

Modularity

Question 7

Q

RDD is related with which Spark characteristic?

Answer

A

Ease of use

Question 8

Q

Which is RDD?

Answer

A

Resilient Distributed Dataset. Partitioned and immutable collection of data.

Question 9

Q

Which languages Spark Support?

Answer

A

Scala, Java, Python, SQL, and R

Question 10

Q

Which modules Spark has?

Answer

A

Spark SQL
Spark Streaming
MLib
GraphX

Question 11

Q

(T or F) Spark can’t deal with late-data.

Question 12

Q

(T or F) Is Spark is fault-tolerant?

Question 13

Q

Which are the 2 main components of Spark?

Answer

A

Driver
Executor

Question 14

Q

Explain Spark Driver

Answer

A

Component responsible to orchestrate parallel operations on Spark cluster

Question 15

Q

Explain Spark Executor

Answer

A

Component responsible to execute the tasks

Question 16

Q

Which component is responsible to request more CPU and memory from the cluster?

Question 17

Q

Which component creates the DAG?

Question 18

Q

Which component communicates with the cluster?

Question 19

Q

Explain SparkSession

Answer

A

Is the conduit to all spark operations and the data.

Question 20

Q

Explain Cluster Manager

Answer

A

Is the responsible to managing and allocating resources for the cluster nodes

Question 21

Q

Explain Spark Architecture

Answer

A

A driver creates a SparkSession to do all the interface between Spark and the data. The SparkContext comunicates with the Cluster which is responsible to allocate the resources and manage the tasks’ execution. The tasks are sent to Workers which do the job.

Question 22

Q

Which component creates SparkContext?

Question 23

Q

Which are the four cluster managers supported by Spark?

Answer

A

Haddop YARN
Mesos
Kubernetes
Standalone

Question 24

Q

Explain Job

Answer

A

Multiple tasks in parallel. Each job is transformed into a DAG. The DAG is the execution plan of a spark.

Question 25

Q

Explain Stage

Answer

A

Smaller parts of a Job

Question 26

Q

Explain task

Answer

A

Single unit of work that will be sent to spark executor

Question 27

Q

What are the 2 types of spark operations?

Answer

A

Transformations and actions

Question 28

Q

Explain transformations

Answer

A

Transform a spark dataframe into a new one. DFs in spark are immutable

Question 29

Q

Actions are evaluated lazy?

Answer

A

False. It is the transformations

Question 30

Q

Explain actions

Answer

A

Triggers the transformations and proccess the data

Question 31

Q

Explain lazy evaluation

Answer

A

Is the delaying of computing data to the later time until an action is invoked.

Question 32

Q

Transformation or action? orderBY()

Answer

A

transformation

Question 33

Q

Transformation or action? show()

Question 34

Q

Transformation or action? groupBy()

Answer

A

transformation

Question 35

Q

Transformation or action? take()

Question 36

Q

Transformation or action? filter()

Answer

A

transformation

Question 37

Q

Transformation or action? select()

Answer

A

transformation

Question 38

Q

Transformation or action? count()

Question 39

Q

Transformation or action? collect()

Question 40

Q

Transformation or action? join()

Answer

A

transformation

Question 41

Q

Transformation or action? save()

Question 42

Q

Explain narrow transformation

Answer

A

When a transformation is held on a single partition and its results is also in a single partition

Question 43

Q

Explain wide transformation

Answer

A

When a transformation take multiple partitions

Question 44

Q

Wide or narrow transformation? filter()

Question 45

Q

Wide or narrow transformation? contains()

Question 46

Q

Wide or narrow transformation? groupBy()

Question 47

Q

Wide or narrow transformation? orderBy()

Question 48

Q

Explain catalyst optimizer

Answer

A

Transform Spark operations into a execution plan to be run on RDD.

Question 49

Q

Which are the 4 steps of catalyst optimizer?

Answer

A

Analysis
Logical Optimization
Physical planning
Code Generation

Question 50

Q

Explain logical optimization

Answer

A

The execution plan is optimized, reordering the operations to gain performance

Question 51

Q

Explain physical planning

Answer

A

The logical optimization is transformed into options of physical executions options to be executed according to the most efficient.

Question 52

Q

Explain explain()

Answer

A

Shows the physical planning and logical optimization

Question 53

Q

List 3 ways to cast a column type

Answer

A

Using col(‘col_name’).cast()
Using select
Using selectExpr

Question 54

Q

Why should we pass schema on reading or creating a DataFrame?

Answer

A

Avoid the costs of infer scheme. Allows the early detection of data errors.

Question 55

Q

Explain 2 ways of creating a schema?

Answer

A

Using StructType and StructField
Using String ‘column1 type, column2 type’

Question 56

Q

Explain 3 ways to get the schema?

Answer

A

df.dtypes
df.schema
df.printSchema()

Question 57

Q

What the main difference between CSV and Parquet

Answer

A

Parquet is columnar as CSV store rows.

Question 58

Q

Which are the 3 ways to select a column and which is the best?

Answer

A

col(‘column_name’) (BEST because it can be referrenced at the moment of creation)
df.column_name
df[‘column_name’]

Question 59

Q

What is the difference between select and selectExpr?

Answer

A

The latter eliminates the use of expr in the select expression.

Question 60

Q

Cite 2 ways to order data

Answer

A

orderBy
filter

Question 61

Q

Explain the usage of asc_nulls_first, asc_nulls_last, desc_nulls_first, desc_nulls_last

Answer

A

It is used to control where the nulls values appear

Question 62

Q

Explain lit

Answer

A

Literal. to put the same value in all rows of a column

Question 63

Q

Explain least and greatest

Answer

A

Compare columns to get the maximun and the minimum values.

Question 64

Q

Explain initcap

Answer

A

Return the first character as upper case

Answer 48

A

[x]
getItem(x)

Answer 49

A

Create a new line for every element in array

Brainscape's Knowledge GenomeTM

Book Concepts Flashcards

Brainscape's Knowledge Genome^TM