Apache Spark Flashcards
Learn about Apache Spark
What is Apache Spark?
An open-source, distributed computing system for big data processing
How does Spark achieve fast data processing?
By performing computations in-memory instead of writing intermediate results to disk
What is the scalability capability of Spark?
It can scale from a single machine to thousands of nodes
What is Spark Core responsible for?
Handles scheduling, memory management, fault tolerance, and task dispatching
What functionality does Spark SQL provide?
Interacts with structured data through SQL queries and integrates with data sources like Hive and Parquet
What is the purpose of Spark Streaming?
Enables real-time stream processing for continuously incoming data
What is MLlib?
A library for machine learning algorithms and tools for big data processing
What does GraphX do?
Analyzes relationships in data, such as social networks or recommendation systems
Which programming languages does Spark support?
Java, Scala, Python, and R
What is Spark’s built-in fault tolerance mechanism?
Keeps data replicas across nodes using Resilient Distributed Datasets (RDDs)
True or False: Apache Spark processes data slower than traditional MapReduce frameworks.
False
Fill in the blank: Apache Spark is designed to handle _______ data processing and analytics.
[large-scale]
What makes Spark suitable for big data processing?
Its speed, scalability, ease of use, and fault tolerance