Apache Spark Flashcards

Question 1

Q

What is HDFS?

What are it’s features?

Answer

A

Hadoop Distributed File System.
Has master/slave architecture: one NameNode and n DataNodes.
Files split into blocks. Blocks replicated at nodes.
Optimized for high throughput of lots of data. Not good for updating data (writing) or latency.
Good for reliability: no immediate action needed in case of (normal) failures.
Enables parallel reading and processing of files.

Question 2

Q

How does Spark deal with data analytics?

Answer

A

Divides computations into small tasks.

Restart of one data analytics task has no affect on other active tasks.

Question 3

Q

What is Apache Spark?

Answer

A

A base platform for advanced analytics provided by higher level libraries: (a) Spark SQL; (b) Spark Streaming; (c) MLlib; and (d) GraphX library

Question 4

Q

What is “RDD”?

Answer

A

Resilient Distributed Dataset.
- a fault tolerant collection of elements to operate on
in parallel.
- enable the work with distributed data collections as
if one would work with local data collections.

Question 5

Q

Difference between Spark and Hadoop?

Why the difference?

Answer

A

Spark is up to 100x faster. Because it performs all operations in RAM.
So Spark requires more RAM than Hadoop.

Question 6

Q

What is YARN? What does it do?

Answer

A

Yet Another Resource Negotiator.
- Is a scheduler for map-reduce jobs.
- Splits up resource management and job scheduling.
ResourceManager (scheduler) controls resources throughout the system.
NodeManager monitors resources in a machine and reports to ResourceManager.

Question 7

Q

What is a Spark DataFrame?

How is it different from RDD?

Answer

A

An immutable distributed collection of data (like RDD).

Difference: data is organized into named columns (closer to relational databases).

Question 8

Q

What is Google Dataproc?

How does it relate to Apache Spark?

Answer

A

A managed Spark and Hadoop service.

It enables scalable & automated cluster management for Spark.

Question 9

Q

What is Apache Hive?

How does it relate to Spark?

Answer

A

Apache Hive is a database structure with an optimized column-style table for ‘big data queries’.
A Spark context SQL can query a Hive database.

Question 10

Q

What is MLlib?

Answer

A

MLib is a machine learning library for Spark.

Question 11

Q

What are Pipelines?
What does:
pipeline = Pipeline(stages=[tokenizer, hashingTF, logReg])
model.pipeline.fit(labeledData)
do?

Answer

A

Pipelines are part of MLlib. Offer a set of API built on DataFrames.
First the pipeline generates features with the tokenizer and hashingTF (converts each set of tokens into a feature vector), and then applies logistic regression for machine learning on the feature vector.
The pipeline.fit() command trained on a labeled dataset in order to apply supervised learning using a logistic regression binary classification model that can be used to classify new unseen data now

Question 12

Q

What is Logistic Regression for?

Answer

A

Finding unknown patterns in data. A.k.a trends in datasets.

Question 13

Q

What is K-Means clustering for?

Answer

A

Finding unknown groups in datasets.

Question 14

Q

What is GraphX?

What does it do?

Answer

A

GraphX is a Spark library that enables iterative graph computations within a single system.
A unified datastructure that provides ability to view big data in tables, graphs, etc. efficiently.

Question 15

Q

What is PageRank?
Where does the name come from?
How does it work?
What is its goal?

Answer

A

PageRank is a tool for evaluating the importance of Web pages.
The name comes from Larry Page (Google founder).
PageRank is driven by probability of landing on a page.
Simplified basic algorithm:
- Start each page at a rank of 1
- On each iteration, have page p contribute rank_of_p /
|nr_of_neighbors_of_p| to its neighbors
- Set each page’s rank to 0.15 + 0.85 × contribs
Goal is to rank pages without being tricked by approaches that fool search engines.

Apache Spark Flashcards

What is/does, concepts, building blocks of