Kafka Streaming Flashcards
Spark vs Hadoop
Hadoop is good for doing mapR on data in batches. In hadoop batch processing is done on data that have been stored over a period of time. So there is lag in processing.
While Spark in real time processing, the data is being processed on as it is being received. So there is no lag in Spark. Spark is also able to do batch processing 100 times faster than Hadoop MapReduce. Spark was built on top of MapReduce and it extends MapReduce model to efficiently use more types of computations.
Some of Spark features
1) Polyglot programming - many language implementations available
2) Speed
3) Powerful caching
4) Multiple Formats - Spark provides multiple data sources such as Parquet, JSON, Hive, Cassandra apart from CSV, RDBMS etc. Data source API provides pluggable mechanism to access structured data through Spark SQL.
5) Lazy Evaluation - Spark is lazy and does computation only when it is needed. This is the reason for speed. For transformations, Spark adds them to DAG of computation and only when driver requests some data, does the DAG actually gets evaluated.
What are Spark Data sources?
The data source API provides a pluggable mechanism for accessing structured data through Spark SQL. Datasource API is used to read and store structured and semi structured data into Spark SQL. Datasources can be more than just simple pipes that convert data and pull it into Spark.
What is RDD (Resilient Distributed Dataset)?
RDD is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions which may be computed on different nodes of cluster.
It has certain characteristics:
1) Logical abstraction for distributed dataset in entire cluster, those may be divided in partitions
2) Resilient and Immutable - It is possible to recreate RDD in any point in time in execution cycle. Each stage of DAG will be RDD.
3) Compile time safety
4) Unstructured/Structured data
5) Lazy - Action starts the execution just like Java streams
What are DataFrames?
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases or existing RDDs.
Apache Spark Components
Spark Core Spark Streaming Spark SQL GraphX MLlib (Machine Learning)
What is Spark Core?
Spark Core is the base engine for large-scale parallel and distributed data processing. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development. Further, additional libraries which are built atop the core allow diverse workloads for streaming, SQL, and machine learning. It is responsible for:
1) Memory management and fault recovery
2) Scheduling, distributing and monitoring jobs on a cluster
3) Interacting with storage systems
What is Spark Streaming?
Spark Streaming is the component of Spark which is used to process real-time streaming data. Thus, it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams. The fundamental stream unit is DStream which is basically a series of RDDs (Resilient Distributed Datasets) to process the real-time data.
What is Spark SQL?
Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools where you can extend the boundaries of traditional relational data processing.
Spark SQL integrates relational processing with Spark’s functional programming. Further, it provides support for various data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.
The following are the four libraries of Spark SQL.
Data Source API
DataFrame API
Interpreter & Optimizer
SQL Service
Stream vs Table
Stream is an unbounded collection of facts, facts are these immutable things that keep coming at you.
A table is a collection of evolving facts. Kinda represents the latest fact about something.
Lets say I am using an music streaming service and I created profile and set my current location to X. It will be event in profile update event stream. And then after sometime I move and stream a song from location Y and change that as current location. Then it will also be event in profile update event stream. But in table, as it is collection of evolving facts, new profile update event kind overrides the previous. And so in table I will have the latest location, i.e. Location Y.
Which is the primary unit of working with data in Spark?
RDD (Resilient Distributed Data) - It is immutable. Any transformation you apply results in RDD in memory and stays immutable and then gets used in next stage of pipeline. This in memory design is what makes Spark faster than Hadoop MapReduce.
Types of nodes in Spark
1) Driver - Driver is like brain or orchestrator, where the query plan is generated
2) Executor - there can be 100s or 1000s of executors / workers
3) Shuffle service
In what ways to Spark nodes communicate?
1) Broadcast - When you have information that is small enough like 1GB or less. Broadcast is information that you have on driver and you want to put that information on all nodes.
2) Take - The information that the executors have and you want to bring that information to your driver to print it out or write it to file. Problem is that there are many many executors and driver is one. Driver can go out of memory or can take too much time for driver to process that data.
3) DAG Action (Directed Acyclic Graph) - DAG are like instructions like what should be done and in what order. DAGs are passed from driver to executor and are small so this is very inexpensive operation.
4) Shuffle - is like if you want to do join, or group by, or reduce. In like RDBMS if there are two large tabes and you want to do join, then you need to partition the table and you need to order that data, so that the two sides can find the match. So in distributed system shuffle, there are mappers which write to multiple shuffle temp locations. These temp locations are broken into partitions. And then you have reduce side where each is reading their partition from their mapper. So there is lot of communication that happens (mapper * reducer). So it is most expensive.
What makes Spark, Spark?
1) RDDs
2) DAGs
3) FlumeJava APIs
4) Long Lived Jobs
What is DAG?
DAG represents that steps that need to happen to get from point A to point B. Like you need to join two tables, and group by at the end and then do some aggregation. That can be multi step DAG. The main thing Spark provides is that DAG lives in the engine.
DAG contains:
1) Source
2) RDDs - which are formed after each transformation
3) Transformations - the edges of DAG
4) Actions (end nodes) - like foreach, print, count, take
Examples of Actions
Actions are the ones that start processing as Spark is lazy
1) foreach
2) reduce
3) take
4) count
Examples of transformation
1) Map
2) Reduce By Key
3) Group By Key
4) Join By Key