Week6 - Apache Spark Flashcards
What is RDD stand for?
Resilient Distributed dataset
Is RDD read only?
Yes
RDDs can only be created through? (2)
1) Data in stable Storage
2) other RDDs
RDD is a restricted Distributed shared____ what
Memory System ( Cached dataset shared memory)
RDD Contains dataset?
Atomic pieces of the dataset
RDD Contains dependencies on?
Parent RDDs
for fault tolerance
How does a RDD compute the dataset
It is based on its parents (for fault tolerance)
metadata about its partitioning scheme and data placement
RDD read only and
Partitioned collections of records
Two important features of RDD and Apache Spark
1) Fault Tolerance
2) Lazy Evaluation
Describe RDD Fault Tolerance
It is achieved through lineage retrieval
Describe RDD Lazy Evaluation
A RDD will not be created until a reduce-like job or persist job is created ( create meaningful output)
What two classes of operations can you do on RDDs
1) Transformations
2) actions
RDD Transformations
Build RDDs through operations on other RDDs
1) map, filter, join
2) lazy operations
RDD Actions
1) Count, Collect, save
2) trigger execution
hdfs is ?
1) text file (Hadoop file system)
2) Distributed file system
3) contain text, log files, errors
How to find errors in htfs files
file.filter(_.contians(“ERROR”))
DAG Scheduler
Partition DAG into efficient stages (think narrow and wide dependencies) Pu
Narrow Dependencies
Transformation: output needs input from only one partition (very title communications )
1) map
2) union
Wide Dependencies
Multiple dependencies… need data from other partition
1) Group by key
2) join with inputs not on the same partitioned
DAG wide dependencies early or late in the process
late (less amount of data)
Hadoop scheduler
Only 2
1) Map
2) reduce
Hadoop where is data stored?
Assumes all data is on disk (intermediate data has to be on disk)
Why fault tolerance
Hadoop API
Only has Map and Reduce procedure programming
Hadoop storage
Only on HDSF file system (now extended)
Hadoop more or less memory than spark
Less, stores data locally
Apache Spark has what APIs
1) Java
2) Scala
3) Python
4 Main Apache Spark Libraries
1) Spark SQL
2) Spark Streaming
3) MLibe (machine learning)
4) GraphX
What is a DataFrame?
Looks like a table (can run SQL operations on it)
What Spark Library does Google PageRank and Shortest Path use for
GraphX
What Spark Library is used for Streaming
Spark Streaming (process data in real time)
Spark Dstream
Abstraction that represents the data streaming source.
What does Spark Dstreamdo?
1) Chops data into batches of x seconds
2) Process the batches like RDD
3) returns the processed RDD results in batches
Spark Dstream batch sizes
1) Low 1/2 second
2) Latency about 1 second
Apache Kafka
Apache Kafka is an open-source stream-processing software platform developed by the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
Dstream - Operations State / Transformations
1) Window (timed window, counting, finding first or last)
Apache Hadoop - Apache Hive
SQL like - A data warehouse infrastructure that provides data summarization and ad hoc querying (HiveQL)
Apache Hadoop - Apache Pig
A high-level data-flow language and execution framework for parallel computation (less ridged than normal Hadoop)
Apache Hadoop - Apache HBase
NoSQL database. (based on BigTable)
A scalable, distributed database that supports structured data storage for large tables.
Apache Hadoop - Apache Zookeeper
Basically Ansible - high-performance Coordination Service for distributed architecture
Apache Mahout
Machine Learn algorithms - distributed linear algebra framework and mathematically expressive Scala DSL
What does Geospark
It takes the RDD layer (generic data processing ) and extends it with spatial data processing operations.
Spatial query processing layer
Out-of-the-box implementation for de facto spatial queries that exist out there. (range query) (KNN k-nearest neighbor query)
(join queries)
What might cause data skew (Geospark)
Creating a grid and inserting data based on the grid coordinates
Creates load balancing problem
Load balancing problem
grid problem. data Skew. Some boxes emply, some boxes too full
For types of grids
1) uniform gird
2) Quad-tree - based on density
3) KDB-tree - no overlap
4) R-Tree - based on clusters (overlap)
When do you build a local index (Geospark)
When running a lot, maybe the same computation
hundreds of thousands of times or thousands of times per each partition
Spatial Join Query (Geospark)
Count the number of points within an area
(Geospark) Is a join or filter more expensive
Join
What are space-filling curves
1) Map 2D data into one number
2) lose details
Space-filling curves is a kind of geometrical slash
mathematical property of the 2D data as you can see that you can partition, you can easily map each 2D cell,
which is represented in a space, which is just one number,
and the numbers of cells represent how close they are in space.
Spatial indexes
1) similar hash tables (uniform grid) index based on grid id
2) Can also use Guad Tree (partition into 4, and partition that into 4..)
3) R-Tree -find rectangles
4) Voronoi Diagram
it partitions the space into different cells such that it has a mathematical property, such that all like, if you do any queries within the cells, the points within these cells are the closest to that point k-nearest