Apache Spark Flashcards
What is/does, concepts, building blocks of
What is HDFS?
What are it’s features?
Hadoop Distributed File System.
Has master/slave architecture: one NameNode and n DataNodes.
Files split into blocks. Blocks replicated at nodes.
Optimized for high throughput of lots of data. Not good for updating data (writing) or latency.
Good for reliability: no immediate action needed in case of (normal) failures.
Enables parallel reading and processing of files.
How does Spark deal with data analytics?
Divides computations into small tasks.
Restart of one data analytics task has no affect on other active tasks.
What is Apache Spark?
A base platform for advanced analytics provided by higher level libraries: (a) Spark SQL; (b) Spark Streaming; (c) MLlib; and (d) GraphX library
What is “RDD”?
Resilient Distributed Dataset.
- a fault tolerant collection of elements to operate on
in parallel.
- enable the work with distributed data collections as
if one would work with local data collections.
Difference between Spark and Hadoop?
Why the difference?
Spark is up to 100x faster. Because it performs all operations in RAM.
So Spark requires more RAM than Hadoop.
What is YARN? What does it do?
Yet Another Resource Negotiator.
- Is a scheduler for map-reduce jobs.
- Splits up resource management and job scheduling.
ResourceManager (scheduler) controls resources throughout the system.
NodeManager monitors resources in a machine and reports to ResourceManager.
What is a Spark DataFrame?
How is it different from RDD?
An immutable distributed collection of data (like RDD).
Difference: data is organized into named columns (closer to relational databases).
What is Google Dataproc?
How does it relate to Apache Spark?
A managed Spark and Hadoop service.
It enables scalable & automated cluster management for Spark.
What is Apache Hive?
How does it relate to Spark?
Apache Hive is a database structure with an optimized column-style table for ‘big data queries’.
A Spark context SQL can query a Hive database.
What is MLlib?
MLib is a machine learning library for Spark.
What are Pipelines? What does: pipeline = Pipeline(stages=[tokenizer, hashingTF, logReg]) model.pipeline.fit(labeledData) do?
Pipelines are part of MLlib. Offer a set of API built on DataFrames.
First the pipeline generates features with the tokenizer and hashingTF (converts each set of tokens into a feature vector), and then applies logistic regression for machine learning on the feature vector.
The pipeline.fit() command trained on a labeled dataset in order to apply supervised learning using a logistic regression binary classification model that can be used to classify new unseen data now
What is Logistic Regression for?
Finding unknown patterns in data. A.k.a trends in datasets.
What is K-Means clustering for?
Finding unknown groups in datasets.
What is GraphX?
What does it do?
GraphX is a Spark library that enables iterative graph computations within a single system.
A unified datastructure that provides ability to view big data in tables, graphs, etc. efficiently.
What is PageRank?
Where does the name come from?
How does it work?
What is its goal?
PageRank is a tool for evaluating the importance of Web pages.
The name comes from Larry Page (Google founder).
PageRank is driven by probability of landing on a page.
Simplified basic algorithm:
- Start each page at a rank of 1
- On each iteration, have page p contribute rank_of_p /
|nr_of_neighbors_of_p| to its neighbors
- Set each page’s rank to 0.15 + 0.85 × contribs
Goal is to rank pages without being tricked by approaches that fool search engines.