Book Concepts Flashcards

1
Q

What are the 4 key characteristics for Spark?

A
  • Speed
  • Ease of Use
  • Modularity
  • Extensibility
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is Spark faster than its predecessors?

A

Because it does it intermediary calculation in-memory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In-memory calculation is related with which Spark characteristic?

A

Speed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

DAGs is related with which Spark characteristic?

A

Speed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Read data from different sources is related with which Spark characteristic?

A

Extensibility

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The spark core modules is related with which Spark characteristic?

A

Modularity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

RDD is related with which Spark characteristic?

A

Ease of use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which is RDD?

A

Resilient Distributed Dataset. Partitioned and immutable collection of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which languages Spark Support?

A

Scala, Java, Python, SQL, and R

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which modules Spark has?

A
  • Spark SQL
  • Spark Streaming
  • MLib
  • GraphX
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

(T or F) Spark can’t deal with late-data.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

(T or F) Is Spark is fault-tolerant?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which are the 2 main components of Spark?

A
  • Driver
  • Executor
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain Spark Driver

A

Component responsible to orchestrate parallel operations on Spark cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain Spark Executor

A

Component responsible to execute the tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which component is responsible to request more CPU and memory from the cluster?

A

Driver

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Which component creates the DAG?

A

Driver

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Which component communicates with the cluster?

A

Driver

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Explain SparkSession

A

Is the conduit to all spark operations and the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Explain Cluster Manager

A

Is the responsible to managing and allocating resources for the cluster nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Explain Spark Architecture

A

A driver creates a SparkSession to do all the interface between Spark and the data. The SparkContext comunicates with the Cluster which is responsible to allocate the resources and manage the tasks’ execution. The tasks are sent to Workers which do the job.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Which component creates SparkContext?

A

Driver

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Which are the four cluster managers supported by Spark?

A
  • Haddop YARN
  • Mesos
  • Kubernetes
  • Standalone
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Explain Job

A

Multiple tasks in parallel. Each job is transformed into a DAG. The DAG is the execution plan of a spark.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Explain Stage

A

Smaller parts of a Job

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Explain task

A

Single unit of work that will be sent to spark executor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are the 2 types of spark operations?

A

Transformations and actions

28
Q

Explain transformations

A

Transform a spark dataframe into a new one. DFs in spark are immutable

29
Q

Actions are evaluated lazy?

A

False. It is the transformations

30
Q

Explain actions

A

Triggers the transformations and proccess the data

31
Q

Explain lazy evaluation

A

Is the delaying of computing data to the later time until an action is invoked.

32
Q

Transformation or action? orderBY()

A

transformation

33
Q

Transformation or action? show()

A

action

34
Q

Transformation or action? groupBy()

A

transformation

35
Q

Transformation or action? take()

A

action

36
Q

Transformation or action? filter()

A

transformation

37
Q

Transformation or action? select()

A

transformation

38
Q

Transformation or action? count()

A

action

39
Q

Transformation or action? collect()

A

action

40
Q

Transformation or action? join()

A

transformation

41
Q

Transformation or action? save()

A

action

42
Q

Explain narrow transformation

A

When a transformation is held on a single partition and its results is also in a single partition

43
Q

Explain wide transformation

A

When a transformation take multiple partitions

44
Q

Wide or narrow transformation? filter()

A

narrow

45
Q

Wide or narrow transformation? contains()

A

narrow

46
Q

Wide or narrow transformation? groupBy()

A

wide

47
Q

Wide or narrow transformation? orderBy()

A

wide

48
Q

Explain catalyst optimizer

A

Transform Spark operations into a execution plan to be run on RDD.

49
Q

Which are the 4 steps of catalyst optimizer?

A
  • Analysis
  • Logical Optimization
  • Physical planning
  • Code Generation
50
Q

Explain logical optimization

A

The execution plan is optimized, reordering the operations to gain performance

51
Q

Explain physical planning

A

The logical optimization is transformed into options of physical executions options to be executed according to the most efficient.

52
Q

Explain explain()

A

Shows the physical planning and logical optimization

53
Q

List 3 ways to cast a column type

A
  • Using col(‘col_name’).cast()
  • Using select
  • Using selectExpr
54
Q

Why should we pass schema on reading or creating a DataFrame?

A

Avoid the costs of infer scheme. Allows the early detection of data errors.

55
Q

Explain 2 ways of creating a schema?

A
  • Using StructType and StructField
  • Using String ‘column1 type, column2 type’
56
Q

Explain 3 ways to get the schema?

A
  • df.dtypes
  • df.schema
  • df.printSchema()
57
Q

What the main difference between CSV and Parquet

A

Parquet is columnar as CSV store rows.

58
Q

Which are the 3 ways to select a column and which is the best?

A
  • col(‘column_name’) (BEST because it can be referrenced at the moment of creation)
  • df.column_name
  • df[‘column_name’]
59
Q

What is the difference between select and selectExpr?

A

The latter eliminates the use of expr in the select expression.

60
Q

Cite 2 ways to order data

A
  • orderBy
  • filter
61
Q

Explain the usage of asc_nulls_first, asc_nulls_last, desc_nulls_first, desc_nulls_last

A

It is used to control where the nulls values appear

62
Q

Explain lit

A

Literal. to put the same value in all rows of a column

63
Q

Explain least and greatest

A

Compare columns to get the maximun and the minimum values.

64
Q

Explain initcap

A

Return the first character as upper case

65
Q

Cite 2 ways to get elements from array

A
  • [x]
  • getItem(x)
66
Q

explain explode

A

Create a new line for every element in array