Spark and DataBricks Flashcards

Question

What is an RDD?

Answer 1

Resilient Distributed Dataset - Low-level API - The most basic data abstraction in Spark - Collection of elements (similar to list/array) partitioned across nodes of the cluster - Can be operated on in parallel - Immutable - but this is just the lineage, not the data driving it - Resilient: - Any point of failure doesn't affect all data and can be fixed - Each RDD knows how it was built allowing it to choose best path for recovery

Answer 2

Transformations and Actions Transformations - instructions for RDD modification; comprise DAG (e.g. map, filter) Actions - instructions to trigger execution of DAG (e.g. collect, count, reduce). Usually result in data transfer back to Driver

Answer 3

- Lazily evaluated (only intent is stored) - Triggered by Actions - Combine to form DAG graphs

Answer 4

Directed Acyclic Graph Vertices represent Resilient distributed systems (RDD) Edges represent the Operation which is to be applied on RDD (transformation or action) Ends up as functional lineage that is sent to worker nodes (handles faults easily)

Answer 5

parallelize, range and makeRDD

Answer 6

hadoopFile - this handles any Hadoop supported file format

Answer 7

Lambdas are anonymous functions Most Spark functions use lambdas e.g. .filter(wikiToken = wikiToken.len > 2)

Answer 8

Transformation Executes provided function against each data item in RDD The result is a new RDD Each node runs the function separately, and it is run on each row in the partition

Answer 9

Transformation Similar to map() but expects array output from each transformation

Answer 10

Transformation Take sample of RDD collection

Answer 11

Transformation Filter the collection based on the given one or multiple conditions or SQL expression. Interchangeable with where()

Answer 12

Transformation Similar to groupBy, but preserves partitions and is more computationally efficient

Answer 13

Transformation Exactly the same as map(); but enables heavy initializations (for example Database connection) once for each partition instead of doing it on every DataFrame row. This helps the performance of the job when you dealing with heavy-weighted initialization on larger datasets.

Answer 14

Transformation Union: Combines RDD datasets from different sources. Should be used with distinct to prevent duplications Intersection Overlap Cartesian Zip

Answer 15

Instructions to trigger execution of DAG (e.g. collect, count, reduce). Usually result in data transfer back to Driver Don't return RDD object

Answer 16

Action Runs actions at partition level before running on Driver - Performance gains as less data movement - Relies on Associative Property - result will always be the same regardless of sequence of execution

Answer 17

Action Collects entire RDD collection back to an array - Pulls ENTIRE dataset back to Driver - Can be huge - ensure what collecting will fit on Driver

Answer 18

Action 'collect' fixed number of elements - similar to sample - takeordered and top are alternatives - 'first' == take(1)

Answer 19

Action Accumulates values in collection e.g. .reduce([1, 2, 3]) = 6 1+2=3 3+3 =6 - fold() is an alternative, but is seeded with 0

Answer 20

Saving data is usually done in distributed way by Workers - Can be written direct to data source - Different to computation which returns results to Driver node saveAsObjectFile(path) saveAsTextFile(path) forEach(T=>Unit) - save RDDs one at a time

Answer 21

Implicit conversions in Scala are the set of methods that are apply when an object of wrong type is used. It allows the compiler to automatically convert of one type to another. Implicit conversions are applied in two conditions: 1. If an expression of type A and S does not match to the expected expression type B 2. In a selection e.m of expression e of type A, if the selector m does not represent a member of A

Answer 22

Repartition data using a hash partitioner, so that all duplicate keys are on the same node (keys aren't unique in Spark) - Once complete, key-value methods work on value portion as keys are the same Automatically available on RDDs containing Tuple2 objects, created by simply writing (a, b). Available in the PairRDDFunctions class, which automatically wraps around an RDD of tuples.

Answer 23

collectAsMap - same as collect mapValues - same as map flatMapValues - same as flatMap Transformations (rather than Actions): reduceByKey foldByKey aggregateByKey groupByKey

Answer 24

This enables joining of 2 key-value RDDs - Join based on key equality - Creates new pair RDDs - Value side of new pair = tuple(value left RDD, value right RDD) Options: - join (inner join of RDD values) - fullOuterJoin (join all values) - leftJoin or rightJoin (inner join plus left or right distinct values) - cogroup or groupWith (pairs one RDD with another): RDD1 1 A 1 B 2 C 2D RDD2 1 E 3 F 3 G groupWith result 1 ([A,B], [E]) 2 ([C,D], []) - empty seq when no equivalent key 3 ([], [F,G])

Answer 25

1. Enables huge performance gains over Hadoop 2. Cache distributed data in memory

Answer 26

1. Same as transformation as just instructions to RDD 2. Need Action to execute and actually cache 3. Returns RDD back rather than creating new RDD - can use returned or cached RDD (they are the same) - system doesn't care where data is coming from

Answer 27

1. DEFAULT - store data in memory 2. MEMORY_AND_DISK - use memory with any overflow saved to disk 3. DISK_ONLY - 3.a _SER - allows data to be serialised _2 - replicate data on second machine Use or combine with disk storage (may be faster than fully re-running DAG - depends on where bottleneck is)

Answer 28

Use the 'unpersist' method - No parameters = default (block execution thread until memory is cleared) - False parameter = allows execution to continue without waiting for memory to clear Cache should automatically clear when RDD falls out of scope

Answer 29

Variables shared across cluster (like Global variables) - Used to track data that is tangential to run, e.g. error # - Create Accumulator in sparkContext, then add it to code on worker nodes - Final accumulation/count occurs on Driver node - Needs to be used in Action methods (otherwise not fault tolerant)

Answer 30

spark-shell

Answer 31

spark-submit

Answer 32

1. Application submitted - launches Driver which runs through main 2. Driver asks cluster manager for specified number of resources 3. Cluster spins up resources for use 4. Driver runs through main application, building up RDD DAG until it hits an Action 5. Action causes Driver to trigger execution of DAG and manage workflow 6. Driver continues through main until execution is complete 7. Resources (Workers) are cleaned up

Answer 33

1. Set up AWS environment - install AWS CLI - Set user permissions for EC2, EMR, S3 and IAM - Get key credentials 2. Spin up cluster via CLI and EMR - aws emr create-cluster - SSH into Driver node - SPARK_PUBLIC_DNS = Driver node address 3. Package scala script into Jar file - Upload to S3 bucket 4. Open spark shell - spark-submit jar file 5. Can then use Spark UI to track job

Answer 34

Same as pandas DataFrames, but optimised for distributed processing and lazy evaluation Easy to convert pandas to Spark and vice versa: - sqlContext.createDataFrame(pandas) - dataFrame.toPandas()

Answer 35

Enables Big and Fast data - GB/sec - Real time use cases - Exactly once transformation semantics (no duplication)

Answer 36

1. Data contained in input streams, e.g. Kafka, Flume, Twitter 2. Spark Streaming Receiver receives stream data - one per input stream* 3. Incoming data packaged into series of RDDs delineated by specified window of time 4. RDDs passed into Spark Core for processing as normal * Can also package up multiple streams into single uber stream - increases throughput

Answer 37

Important for Spark Streaming Stateful operations (such as countByWindowAndValue) need to set checkpoint directory for streaming content - Sink to reliable storage of current state - Required because stateful methods are driven by RDD relying on chain of previously batched RDDs - this can result in huge dependency graph - checkpoint resets origination point for this graph

Answer 38

SparkConf, SparkContext and for streaming StreamingContext

Answer 39

SparkContext is the entry point for interacting with Spark and represents the connection to a Spark cluster. - Task creator: builds execution graph (DAG) sent to each worker - Scheduler: schedules work across worker nodes - Data locality: takes advantage of existing data location knowledge, and send work to data (avoids movement) - Fault tolerance: monitors tasks for failures and coordinates rebuilds Note: You can create multiple SparkContexts for the same job, but this is best avoided - can result in unexpected behaviour

Answer 40

Converts table structures to graph - usually build from two RDDs: one for vertices and one for edges - Executions are run through graph parallel pattern (each node's computation depends on its neighbours) - Impressive performance gains - Enables built in methods such as pageRank

Answer 41

When an enclosing function contains another function that updates a bound variable e.g. function contains a variable, and a for loop inside this function updates the variable (e.g. +1 per loop) Once distributed across nodes this will cause issues (iterable is distributed across nodes, so for loop will run less on each node Better to use built in functions or Accumulators for these sort of operations

Answer 42

Broadcasting enables data to travel to worker nodes once per job rather than for each execution This means that memory footprint is: size_object * #_workers Rather than: size_object * #_workers * #_executions More recently, after worker node receives data it helps Driver to distribute it amongst other nodes (faster)

Answer 43

If you have big data, which is then filtered or otherwise reduced in size during a DAG, the number of nodes you needed at the beginning might be overkill by the end This can have big performance impacts (running jobs with empty partitions) You can use .coalesce(num_partitions) to reduce # partitions during a job e.g. RDD(#, 1000_partitions) .filter(lots) .coalesce(8_partitions, shuffle_or_not) .next_functions_run_on_8_partitions

Answer 44

Apache Spark is an open-source distributed computing system designed for processing large-scale data sets across clusters of computers. Core components: 1. Driver Program: defines the application and coordinates the execution across the cluster (via SparkContext) 2. Cluster Manager: allocating resources to Spark applications 3. SparkContext: connects with the cluster manager and coordinates the execution of tasks and data 4. Resilient Distributed Datasets (RDDs): RDD is the fundamental data structure 5. Transformations: operations applied to RDDs to create a new RDD. Lazy 6. Actions: trigger the execution of computations and return results to the driver program/storage. Eager. 7. DAG Scheduler: The DAG (Directed Acyclic Graph) scheduler breaks down the Spark application's execution plan into stages. It analyzes the dependencies between RDDs and transforms them into a logical execution plan. 8. Task Scheduler: assigns tasks to workers in the cluster. It takes the output of the DAG scheduler (stages) and divides them into individual tasks. These tasks are then scheduled and executed on worker nodes. 9. Worker Nodes: perform the actual computations 10. Executors: Executors are worker node processes responsible for executing tasks and storing data in memory or disk 11. Shuffle: process of redistributing data across partitions or nodes to perform operations 12. Storage: persist data in memory, on disk, or a combination of both

Answer 45

A partition is a logical division of the data, and each partition is stored on a separate machine in the cluster. Spark uses RDDs and their partitioning strategy 1. Data Source Partitioning: Spark automatically partitions data based on the source characteristics. e.g. if read from a file, each block or split of the file may become a partition 2. Transformations and Dependency Tracking: Spark tracks lineage of transformations and dependencies between RDDs. This informs partitioning based on the execution plan 3. Narrow and Wide Transformations: Narrow transformations do not require shuffling or data movement across partitions, e.g. map, filter, etc. Wide transformations, e.g. reduceByKey or groupBy, require data shuffling and may result in the re-partitioning of the data.

Answer 46

RDDs provide a low-level, fine-grained API DataFrames offer a high-level structured data manipulation API with optimization opportunities Datasets combine the benefits of RDDs and DataFrames with stronger typing

Answer 47

Lazy evaluation refers to the delayed execution of transformations on data until an action is called. Spark builds a directed acyclic graph (DAG) representing the sequence of transformations applied to the data, which is then executed by an action. Benefits: 1. optimization opportunities: execute dag as efficiently as poss; group transformations and use predicate pushdown, column pruning, and operator fusion 2. efficient resource utilization 3. reduced I/O overhead: pipelining and in-memory caching 4. selective computation 5. data recovery and fault tolerance: RDDs are immutable so they and their lineage can be recovered in case of failures 7. programmatic control: enables modular and reusable code

Answer 48

1. Prevent Data Corruption - supports multi-statement transactions, allowing atomic commits and rollbacks. This enables consistent and reliable data updates 2. Faster Queries - optimises Parquet; efficient compression and ordering, Delta Engine (vectorisation) 3. Increase Data Freshness 4. Reproduce ML Models - MLFlow and Time Travel feature (snapshot of data as it was at point in time) 5. Achieve Compliance - includes DELETE and UPDATE actions for the easy manipulation of data in a table (GDPR etc)

Answer 49

1. Fully Managed Service 2. Ease of Use - UI, notebooks, language compatibility 3. Scalability and Elasticity 4. Performance Optimization - Databricks Runtime ensures that you benefit from the latest performance improvements without the need for manual configuration 5. Integrated Workspace - collaborative environment for data scientists, engineers, and analysts 6. Integration with Ecosystem 7. Security and Compliance - IAM 8. Managed Delta Lake

Answer 50

1. Data Partitioning - based on data characteristics and operations to be performed 2. Data Skew - identify skewed keys and evenly distribute: data repartitioning, key-value rebalancing, or using specialized operations like skewedJoin 3. Transformations and Actions - -chain operations together. -Use narrow transformations (e.g., map, filter) whenever possible -Leverage wide transformations (e.g., reduceByKey, groupBy) judiciously, -Apply filtering or aggregation before wide transformations to reduce data size. 4. Data Caching and Persistence - Selectively cache data based on the size of the data and its reuse frequency. 5. Data Serialization - choose an efficient serialization format like Apache Parquet to reduce data size and improve I/O performance. 6. Resource Configuration and Allocation - utilize available compute resources effectively 7. Broadcasting and Accumulators 8. Partitioning and Bucketing - keep related data together, improving query performance on large datasets 9. Monitoring and Tuning 10. Code Optimization and Profiling

Answer 51

1. Setup Infrastructure (cluster of machines or virtual machines) 2. Install Java on all machines in cluster. Apache Spark requires Java to run. 3. Download Apache Spark, choose the package type that matches your cluster setup (e.g., pre-built for Hadoop, without Hadoop dependencies). 4. Distribute Spark Package to all machines in cluster 5. Configure Spark configuration files to suit your cluster setup (spark-defaults.conf) - memory allocation, cluster manager, and executor settings 6. Start the Master Node - start the Spark master by running the following command on the chosen machine: ./sbin/start-master.sh 7. Start Worker Nodes: On the remaining machines in the cluster, run: ./sbin/start-worker.sh 8. Verify Cluster Setup: Access the Spark web UI by opening a web browser and navigating to the URL of the Spark master node (e.g., http://:8080). The Spark web UI provides information about the cluster's status, worker nodes, and resource allocation. 9. Once the Spark cluster is up and running, you can submit Spark applications or run Spark jobs by utilizing the Spark APIs or submitting applications using the spark-submit script.

Answer 52

Data shuffling refers to the redistribution of data across partitions during certain operations, such as groupByKey or join. Spark uses a combination of in-memory computing and disk-based operations to efficiently handle data shuffling, minimizing the amount of data transferred over the network.

Answer 53

Step 1 – Import PySpark Step 2 – Create SparkSession with Hive enabled Step 3 – Read Hive table into Spark DataFrame using spark.sql() Step 4 – Read using spark.read.table() Step 5 – Connect to remove Hive.

Answer 54

1. Determine the optimal cluster size 2. Auto-Scaling (automatically adjust cluster size) 3. Choose the appropriate instance type for the workload and tune Spark configuration parameters (e.g. # cores) 4. Dynamic Allocation of resources based on task demands 5. Parallelism and Data Partitioning 6. Caching and Persistence 7. Data Compression 8. Broadcast Variables (share small datasets across the cluster) 9. Efficient Data I/O (read and write operations) 10. Query Optimization (predicate pushdown, join optimization, and aggregate pushdown, to minimize data movement) 11. Monitoring and Tuning- Analyze performance bottlenecks 12. Lifecycle Management (start and stop cluster when required)

Spark and DataBricks Flashcards

(78 cards)