Apache Spark Flashcards

What is/does, concepts, building blocks of

1
Q

What is HDFS?

What are it’s features?

A

Hadoop Distributed File System.
Has master/slave architecture: one NameNode and n DataNodes.
Files split into blocks. Blocks replicated at nodes.
Optimized for high throughput of lots of data. Not good for updating data (writing) or latency.
Good for reliability: no immediate action needed in case of (normal) failures.
Enables parallel reading and processing of files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does Spark deal with data analytics?

A

Divides computations into small tasks.

Restart of one data analytics task has no affect on other active tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Apache Spark?

A

A base platform for advanced analytics provided by higher level libraries: (a) Spark SQL; (b) Spark Streaming; (c) MLlib; and (d) GraphX library

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is “RDD”?

A

Resilient Distributed Dataset.
- a fault tolerant collection of elements to operate on
in parallel.
- enable the work with distributed data collections as
if one would work with local data collections.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Difference between Spark and Hadoop?

Why the difference?

A

Spark is up to 100x faster. Because it performs all operations in RAM.
So Spark requires more RAM than Hadoop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is YARN? What does it do?

A

Yet Another Resource Negotiator.
- Is a scheduler for map-reduce jobs.
- Splits up resource management and job scheduling.
ResourceManager (scheduler) controls resources throughout the system.
NodeManager monitors resources in a machine and reports to ResourceManager.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a Spark DataFrame?

How is it different from RDD?

A

An immutable distributed collection of data (like RDD).

Difference: data is organized into named columns (closer to relational databases).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Google Dataproc?

How does it relate to Apache Spark?

A

A managed Spark and Hadoop service.

It enables scalable & automated cluster management for Spark.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Apache Hive?

How does it relate to Spark?

A

Apache Hive is a database structure with an optimized column-style table for ‘big data queries’.
A Spark context SQL can query a Hive database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is MLlib?

A

MLib is a machine learning library for Spark.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
What are Pipelines?
What does:
pipeline = Pipeline(stages=[tokenizer, hashingTF, logReg])
model.pipeline.fit(labeledData)
do?
A

Pipelines are part of MLlib. Offer a set of API built on DataFrames.
First the pipeline generates features with the tokenizer and hashingTF (converts each set of tokens into a feature vector), and then applies logistic regression for machine learning on the feature vector.
The pipeline.fit() command trained on a labeled dataset in order to apply supervised learning using a logistic regression binary classification model that can be used to classify new unseen data now

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Logistic Regression for?

A

Finding unknown patterns in data. A.k.a trends in datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is K-Means clustering for?

A

Finding unknown groups in datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is GraphX?

What does it do?

A

GraphX is a Spark library that enables iterative graph computations within a single system.
A unified datastructure that provides ability to view big data in tables, graphs, etc. efficiently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is PageRank?
Where does the name come from?
How does it work?
What is its goal?

A

PageRank is a tool for evaluating the importance of Web pages.
The name comes from Larry Page (Google founder).
PageRank is driven by probability of landing on a page.
Simplified basic algorithm:
- Start each page at a rank of 1
- On each iteration, have page p contribute rank_of_p /
|nr_of_neighbors_of_p| to its neighbors
- Set each page’s rank to 0.15 + 0.85 × contribs
Goal is to rank pages without being tricked by approaches that fool search engines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly