Spark and DataBricks Flashcards

1
Q

Can you explain the design schemas relevant to data modeling?

A

There are three data modeling design schemas: Star, Snowflake, and Galaxy.

The star schema contains various dimension tables which are connected to that fact table in the center.

Snowflake is the extension of the star schema. It consists of a fact table and dimension tables with snowflake-like layers.

The Galaxy schema contains two fact tables, and it shares dimension tables between them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why do data systems require a disaster recovery plan?

A

Disaster recovery planning involves real-time backing up of files and media. The backup storage will be used to restore files in case of a cyber-attack or equipment failure. Security protocols are placed to monitor, trace, and restrict both incoming and outgoing traffic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is data orchestration, and what tools can you use to perform it?

A

Data orchestration is an automated process for accessing raw data from multiple sources, performing data cleaning, transformation, and modeling techniques, and serving it for analytical tasks. The most popular tools are Apache Airflow, Prefect, Dagster, and AWS Glue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What issues does Apache Airflow resolve?

A

Apache Airflow allows you to manage and schedule pipelines for the analytical workflow, data warehouse management, and data transformation and modeling under one roof.

You can monitor execution logs in one place, and callbacks can be used to send failure alerts to Slack and Discord. Finally, it is easy to use, provides a helpful user interface and robust integrations, and is free to use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the various modes in Hadoop?

A

Hadoop mainly works on 3 modes:

Standalone Mode: it is used for debugging where you don’t use HDFS. It uses a local file system for input and output.

Pseudo-distributed Mode: consists of a single node cluster where NameNode and Data node reside at the same place. It is mainly used for testing purposes.

Fully-Distributed Mode: it is a production-ready mode where multiple clusters are running. The data is distributed across multiple nodes. It has separate nodes for master and slave daemons.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the three V’s of big data?

A

Volume (of data)
Velocity (how fast it’s coming in)
Variety (diversity of structure and content)

Additional V’s:
Veracity (accuracy, trustworthiness)
Value
Validity
Visualisation
Variability
Vulnerability
Visibility
Volatility

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the definition of big data?

A

Depends on situation, but typically any of:

  • > 100TB
  • Requires parallel processing
  • Too large for operational databases
  • Requires big data technology (even if it’s ‘small’ data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is data gravity?

A

Lots of data on single cloud platform:
- More value
- Harder to move data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Map Reduce?

A

MapReduce is a programming model or pattern within the Hadoop framework that is used to access big data stored in the Hadoop File System (HDFS).

  1. Split single large dataset into multiple smaller datasets
  2. Each dataset is sent to a node in compute cluster (called mapper)
  3. Mapper converts data to key-value format, processes and puts in series of output files
  4. Data is collated by key - all data for given key is put in same file - can put different keys into same file, but never split key across files
  5. Files are sent to other nodes in cluster called reducer nodes
  6. Reducers reduce the series of values for each key into a single value (aggregation)
  7. Outputs are combined into single output for the job
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Massively Parallel Processing?

A

Massively parallel is the term for using a large number of computer processors to simultaneously perform a set of coordinated computations in parallel.

GPUs are massively parallel architecture with tens of thousands of threads.

  1. User submits single SQL query to data warehouse (cluster) master node
  2. Master node takes SQL query and breaks down into sub-queries which are sent to each worker node
  3. Worker nodes execute sub-queries (all sharing same data and storage), and all queries are executed in parallel
  4. Worker node results sent to master node and combined into single result which is sent to user
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the difference between ETL and ELT pipelines?

A

ETL = Extract, Transform, Load
- Traditional warehousing approach
- Load and transform data in memory

ELT = Extract, Load, Transform
- Move data to destination first
- More efficient processing at destination
- More resilient (separation of data moving and processing)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Data Virtualisation?

A

Combine and transform data sources without physically modifying data (leave data where it is)

  • Good when too many data sources for ETL/ELT to be sustainable
  • Good when data movement too expensive
  • Good for highly regulated data
  • Federated querying (multiple data sources) is possible: connectivity to multiple backends
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Spark SQL?

A

Allows developers to write declarative code in Spark jobs

  • Abstracts out distributed nature
  • Is to Spark what HIVE is to Hadoop; but MUCH faster than HIVE and easier to unit test
  • Creates dataframes as containers for resulting data: same structures used for Spark Streaming and Spark ML (can mix and match jobs)

Compatible with multiple data sources: HIVE, JSON, CSV, Parquet etc

Additional optimisations:
- Predicate pushdown
- Column pruning
- Uniform API
- Code generation (performance gains, esp. for Python)
- Can hop in and out of RDDs and SQL as needed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is predicate pushdown?

A

Parts of SQL queries that filter data are called ‘predicates’

A predicate push down filters the data in the database query, reducing the number of entries retrieved from the database and improving query performance. By default the Spark Dataset API will automatically push down valid WHERE clauses to the database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is column pruning in Spark SQL?

A

An analyser decides if only a subset of columns are required for the output and drops unnecessary columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Apache Parquet?

A

The data lake format of choice

  • Stores data in columns
  • Efficient for querying
  • Enables compression
  • Easy partitioning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is PrestoDB?

A

MPP SQL on anything and data virtualisation engine (no storage engine)

  • Displacing HIVE
  • Increasingly popular for data lakes
  • Functions like a data warehouse, but without storage
  • Connects to multiple back end data sources
  • Blurs lines between data lakes and warehouses
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is Apache Kafka?

A

Event streaming engine

  • Uses message queue paradigm to model streaming data through ‘topics’
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is cluster computing?

A

A collection of servers (nodes) that are federated and can operate together

  • One Driver node and multiple Worker nodes
  • Apps talk to Driver, which controls Workers
  • Workers parallelise the work (horizonal scaling)
  • Designed for failure - redundancy and fault tolerance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are Containers?

A

‘deployment packages’ or ‘lightweight virtual machines’

In contrast to virtual machines which are digital images of entire computers, containers only contain the software required for a specific piece of software (no OS etc)

  • Much faster than VMs
  • Can deploy groups to orchestrate together
  • Portable between cloud/on prem etc
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are container orchestration (cluster manager) options for Spark?

A

Cluster manager: oversees multiple processes

  • Spark Standalone: built in manager
  • YARN: Hadoop manager
  • Mesos: Comparable to YARN but more flexible
  • Kubernetes: recently added
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the key benefits of Spark vs Hadoop?

A
  1. Increased efficiency (less machines for same results as Hadoop)
  2. Much faster
  3. Less code (generalised abstractions)
  4. Caches data in memory
  5. Abstracts away distributed nature (can write code ignoring this)
  6. Interactive (can play with data on the fly)
  7. Fault tolerance
  8. Unify big data needs (answer to MapReduce explosion)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is DataBricks relationship to Spark?

A
  1. Founded by Spark creators
  2. Maintain Spark repo and ecosystem
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What languages can you use for Spark?

A

Spark is written in Scala and this is it’s native language

Java and Python can also be used
Python API mirrors Scala most closely

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is an RDD?

A

Resilient Distributed Dataset

  • Low-level API
  • The most basic data abstraction in Spark
  • Collection of elements (similar to list/array) partitioned across nodes of the cluster
  • Can be operated on in parallel
  • Immutable - but this is just the lineage, not the data driving it
  • Resilient:
    • Any point of failure doesn’t affect all data and can be fixed
    • Each RDD knows how it was built allowing it to choose best path for recovery
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are the two categories of core operation in Spark?

A

Transformations and Actions

Transformations - instructions for RDD modification; comprise DAG (e.g. map, filter)

Actions - instructions to trigger execution of DAG (e.g. collect, count, reduce). Usually result in data transfer back to Driver

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are the qualities of Spark Transformations?

A
  • Lazily evaluated (only intent is stored)
  • Triggered by Actions
  • Combine to form DAG graphs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is a DAG?

A

Directed Acyclic Graph

Vertices represent Resilient distributed systems (RDD)
Edges represent the Operation which is to be applied on RDD (transformation or action)

Ends up as functional lineage that is sent to worker nodes (handles faults easily)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What are the three memory loading input data methods in Spark?

A

parallelize, range and makeRDD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is the underlying method for most file ingestion methods in Spark?

A

hadoopFile - this handles any Hadoop supported file format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How does Spark use Lambda functions?

A

Lambdas are anonymous functions

Most Spark functions use lambdas

e.g.
.filter(wikiToken = wikiToken.len > 2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What does Spark .map() do?

A

Transformation

Executes provided function against each data item in RDD

The result is a new RDD

Each node runs the function separately, and it is run on each row in the partition

33
Q

What does Spark .flatMap() do?

A

Transformation

Similar to map() but expects array output from each transformation

34
Q

What does Spark .sample() do?

A

Transformation

Take sample of RDD collection

35
Q

What does Spark .filter() do?

A

Transformation

Filter the collection based on the given one or multiple conditions or SQL expression.

Interchangeable with where()

36
Q

What does Spark keyBy() do?

A

Transformation

Similar to groupBy, but preserves partitions and is more computationally efficient

37
Q

What does Spark mapPartitions() do?

A

Transformation

Exactly the same as map(); but enables heavy initializations (for example Database connection) once for each partition instead of doing it on every DataFrame row.

This helps the performance of the job when you dealing with heavy-weighted initialization on larger datasets.

38
Q

What union methods are available in Spark?

A

Transformation

Union: Combines RDD datasets from different sources. Should be used with distinct to prevent duplications

Intersection
Overlap
Cartesian
Zip

39
Q

What is an Action in Spark?

A

Instructions to trigger execution of DAG (e.g. collect, count, reduce).

Usually result in data transfer back to Driver

Don’t return RDD object

40
Q

What does maptypecombine do in Spark?

A

Action

Runs actions at partition level before running on Driver
- Performance gains as less data movement
- Relies on Associative Property - result will always be the same regardless of sequence of execution

41
Q

What is collect() in Spark?

A

Action

Collects entire RDD collection back to an array
- Pulls ENTIRE dataset back to Driver
- Can be huge - ensure what collecting will fit on Driver

42
Q

What is take() in Spark?

A

Action

‘collect’ fixed number of elements - similar to sample

  • takeordered and top are alternatives
  • ‘first’ == take(1)
43
Q

What is reduce() in Spark?

A

Action

Accumulates values in collection

e.g. .reduce([1, 2, 3]) = 6

1+2=3
3+3 =6

  • fold() is an alternative, but is seeded with 0
44
Q

How is data persistence handled in Spark?

A

Saving data is usually done in distributed way by Workers

  • Can be written direct to data source
  • Different to computation which returns results to Driver node

saveAsObjectFile(path)
saveAsTextFile(path)
forEach(T=>Unit) - save RDDs one at a time

45
Q

What is an Implicit Conversion in Spark?

A

Implicit conversions in Scala are the set of methods that are apply when an object of wrong type is used. It allows the compiler to automatically convert of one type to another.
Implicit conversions are applied in two conditions:

  1. If an expression of type A and S does not match to the expected expression type B
  2. In a selection e.m of expression e of type A, if the selector m does not represent a member of A
46
Q

What are the properties of key-value methods in Spark?

A

Repartition data using a hash partitioner, so that all duplicate keys are on the same node (keys aren’t unique in Spark)
- Once complete, key-value methods work on value portion as keys are the same

Automatically available on RDDs containing Tuple2 objects, created by simply writing (a, b).

Available in the PairRDDFunctions class, which automatically wraps around an RDD of tuples.

47
Q

What are some key-pair methods examples in Spark?

A

collectAsMap - same as collect
mapValues - same as map
flatMapValues - same as flatMap

Transformations (rather than Actions):
reduceByKey
foldByKey
aggregateByKey

groupByKey

48
Q

How does SQL-Like Pairing work in Spark?

A

This enables joining of 2 key-value RDDs

  • Join based on key equality
  • Creates new pair RDDs
  • Value side of new pair = tuple(value left RDD, value right RDD)

Options:
- join (inner join of RDD values)
- fullOuterJoin (join all values)
- leftJoin or rightJoin (inner join plus left or right distinct values)
- cogroup or groupWith (pairs one RDD with another):

RDD1
1 A
1 B
2 C
2D

RDD2
1 E
3 F
3 G

groupWith result
1 ([A,B], [E])
2 ([C,D], []) - empty seq when no equivalent key
3 ([], [F,G])

49
Q

Why is cache important in Spark?

A
  1. Enables huge performance gains over Hadoop
  2. Cache distributed data in memory
50
Q

What are the qualities of the RDD cache/persist method?

A
  1. Same as transformation as just instructions to RDD
  2. Need Action to execute and actually cache
  3. Returns RDD back rather than creating new RDD - can use returned or cached RDD (they are the same) - system doesn’t care where data is coming from
51
Q

What are the different levels of persist/caching in RDDs?

A
  1. DEFAULT - store data in memory
  2. MEMORY_AND_DISK - use memory with any overflow saved to disk
  3. DISK_ONLY -
    3.a _SER - allows data to be serialised
    _2 - replicate data on second machine

Use or combine with disk storage (may be faster than fully re-running DAG - depends on where bottleneck is)

52
Q

How do you clear RDD cache in Spark?

A

Use the ‘unpersist’ method

  • No parameters = default (block execution thread until memory is cleared)
  • False parameter = allows execution to continue without waiting for memory to clear

Cache should automatically clear when RDD falls out of scope

53
Q

What are Accumulators in Spark?

A

Variables shared across cluster (like Global variables)

  • Used to track data that is tangential to run, e.g. error #
  • Create Accumulator in sparkContext, then add it to code on worker nodes
  • Final accumulation/count occurs on Driver node
  • Needs to be used in Action methods (otherwise not fault tolerant)
54
Q

How do you start Spark console?

A

spark-shell

55
Q

How do you execute Spark job?

A

spark-submit

56
Q

What are the steps of a spark-submit?

A
  1. Application submitted - launches Driver which runs through main
  2. Driver asks cluster manager for specified number of resources
  3. Cluster spins up resources for use
  4. Driver runs through main application, building up RDD DAG until it hits an Action
  5. Action causes Driver to trigger execution of DAG and manage workflow
  6. Driver continues through main until execution is complete
  7. Resources (Workers) are cleaned up
57
Q

What are the steps for spinning up a Spark cluster in AWS?

A
  1. Set up AWS environment
    • install AWS CLI
    • Set user permissions for EC2, EMR, S3 and IAM
    • Get key credentials
  2. Spin up cluster via CLI and EMR
    • aws emr create-cluster
    • SSH into Driver node
    • SPARK_PUBLIC_DNS = Driver node address
  3. Package scala script into Jar file
    • Upload to S3 bucket
  4. Open spark shell
    • spark-submit jar file
  5. Can then use Spark UI to track job
58
Q

What are Spark DataFrames?

A

Same as pandas DataFrames, but optimised for distributed processing and lazy evaluation

Easy to convert pandas to Spark and vice versa:
- sqlContext.createDataFrame(pandas)
- dataFrame.toPandas()

59
Q

What is Spark Streaming?

A

Enables Big and Fast data

  • GB/sec
  • Real time use cases
  • Exactly once transformation semantics (no duplication)
60
Q

What is the process for Spark Streaming?

A
  1. Data contained in input streams, e.g. Kafka, Flume, Twitter
  2. Spark Streaming Receiver receives stream data - one per input stream*
  3. Incoming data packaged into series of RDDs delineated by specified window of time
  4. RDDs passed into Spark Core for processing as normal
  • Can also package up multiple streams into single uber stream - increases throughput
61
Q

What is checkpointing in Spark?

A

Important for Spark Streaming

Stateful operations (such as countByWindowAndValue) need to set checkpoint directory for streaming content

  • Sink to reliable storage of current state
  • Required because stateful methods are driven by RDD relying on chain of previously batched RDDs
    • this can result in huge dependency graph
    • checkpoint resets origination point for this graph
62
Q

What are the key objects needed for running spark job?

A

SparkConf, SparkContext and for streaming StreamingContext

63
Q

What is the Spark Context?

A

SparkContext is the entry point for interacting with Spark and represents the connection to a Spark cluster.

  • Task creator: builds execution graph (DAG) sent to each worker
  • Scheduler: schedules work across worker nodes
  • Data locality: takes advantage of existing data location knowledge, and send work to data (avoids movement)
  • Fault tolerance: monitors tasks for failures and coordinates rebuilds

Note: You can create multiple SparkContexts for the same job, but this is best avoided - can result in unexpected behaviour

64
Q

How does Spark GraphX work?

A

Converts table structures to graph - usually build from two RDDs: one for vertices and one for edges

  • Executions are run through graph parallel pattern (each node’s computation depends on its neighbours)
  • Impressive performance gains
  • Enables built in methods such as pageRank
65
Q

What are Closures in Spark?

A

When an enclosing function contains another function that updates a bound variable

e.g. function contains a variable, and a for loop inside this function updates the variable (e.g. +1 per loop)

Once distributed across nodes this will cause issues (iterable is distributed across nodes, so for loop will run less on each node

Better to use built in functions or Accumulators for these sort of operations

66
Q

What is broadcasting in Spark?

A

Broadcasting enables data to travel to worker nodes once per job rather than for each execution

This means that memory footprint is:
size_object * #_workers
Rather than:
size_object * #_workers * #_executions

More recently, after worker node receives data it helps Driver to distribute it amongst other nodes (faster)

67
Q

How can you optimise partitioning in Spark?

A

If you have big data, which is then filtered or otherwise reduced in size during a DAG, the number of nodes you needed at the beginning might be overkill by the end

This can have big performance impacts (running jobs with empty partitions)

You can use .coalesce(num_partitions) to reduce # partitions during a job

e.g.
RDD(#, 1000_partitions)
.filter(lots)
.coalesce(8_partitions, shuffle_or_not)
.next_functions_run_on_8_partitions

68
Q

Can you explain the architecture and components of Apache Spark?

A

Apache Spark is an open-source distributed computing system designed for processing large-scale data sets across clusters of computers.

Core components:
1. Driver Program: defines the application and coordinates the execution across the cluster (via SparkContext)

  1. Cluster Manager: allocating resources to Spark applications
  2. SparkContext: connects with the cluster manager and coordinates the execution of tasks and data
  3. Resilient Distributed Datasets (RDDs): RDD is the fundamental data structure
  4. Transformations: operations applied to RDDs to create a new RDD. Lazy
  5. Actions: trigger the execution of computations and return results to the driver program/storage. Eager.
  6. DAG Scheduler: The DAG (Directed Acyclic Graph) scheduler breaks down the Spark application’s execution plan into stages. It analyzes the dependencies between RDDs and transforms them into a logical execution plan.
  7. Task Scheduler: assigns tasks to workers in the cluster. It takes the output of the DAG scheduler (stages) and divides them into individual tasks. These tasks are then scheduled and executed on worker nodes.
  8. Worker Nodes: perform the actual computations
  9. Executors: Executors are worker node processes responsible for executing tasks and storing data in memory or disk
  10. Shuffle: process of redistributing data across partitions or nodes to perform operations
  11. Storage: persist data in memory, on disk, or a combination of both
69
Q

How does Spark handle data partitioning and distribution across a cluster?

A

A partition is a logical division of the data, and each partition is stored on a separate machine in the cluster.

Spark uses RDDs and their partitioning strategy

  1. Data Source Partitioning: Spark automatically partitions data based on the source characteristics. e.g. if read from a file, each block or split of the file may become a partition
  2. Transformations and Dependency Tracking: Spark tracks lineage of transformations and dependencies between RDDs. This informs partitioning based on the execution plan
  3. Narrow and Wide Transformations: Narrow transformations do not require shuffling or data movement across partitions, e.g. map, filter, etc. Wide transformations, e.g. reduceByKey or groupBy, require data shuffling and may result in the re-partitioning of the data.
70
Q

What are the differences between RDDs, DataFrames, and Datasets in Spark?

A

RDDs provide a low-level, fine-grained API

DataFrames offer a high-level structured data manipulation API with optimization opportunities

Datasets combine the benefits of RDDs and DataFrames with stronger typing

71
Q

What is lazy evaluation?

A

Lazy evaluation refers to the delayed execution of transformations on data until an action is called.

Spark builds a directed acyclic graph (DAG) representing the sequence of transformations applied to the data, which is then executed by an action.

Benefits:
1. optimization opportunities: execute dag as efficiently as poss; group transformations and use predicate pushdown, column pruning, and operator fusion
2. efficient resource utilization
3. reduced I/O overhead: pipelining and in-memory caching
4. selective computation
5. data recovery and fault tolerance: RDDs are immutable so they and their lineage can be recovered in case of failures
7. programmatic control: enables modular and reusable code

72
Q

How does Delta Lake improve storage and performance in Spark?

A
  1. Prevent Data Corruption - supports multi-statement transactions, allowing atomic commits and rollbacks. This enables consistent and reliable data updates
  2. Faster Queries - optimises Parquet; efficient compression and ordering, Delta Engine (vectorisation)
  3. Increase Data Freshness
  4. Reproduce ML Models - MLFlow and Time Travel feature (snapshot of data as it was at point in time)
  5. Achieve Compliance - includes DELETE and UPDATE actions for the easy manipulation of data in a table (GDPR etc)
73
Q

What are the advantages of using Databricks over traditional Apache Spark clusters?

A
  1. Fully Managed Service
  2. Ease of Use - UI, notebooks, language compatibility
  3. Scalability and Elasticity
  4. Performance Optimization - Databricks Runtime ensures that you benefit from the latest performance improvements without the need for manual configuration
  5. Integrated Workspace - collaborative environment for data scientists, engineers, and analysts
  6. Integration with Ecosystem
  7. Security and Compliance - IAM
  8. Managed Delta Lake
74
Q

How do you optimize Spark jobs for performance and efficiency?

A
  1. Data Partitioning - based on data characteristics and operations to be performed
  2. Data Skew - identify skewed keys and evenly distribute: data repartitioning, key-value rebalancing, or using specialized operations like skewedJoin
  3. Transformations and Actions -
    -chain operations together.
    -Use narrow transformations (e.g., map, filter) whenever possible
    -Leverage wide transformations (e.g., reduceByKey, groupBy) judiciously,
    -Apply filtering or aggregation before wide transformations to reduce data size.
  4. Data Caching and Persistence -
    Selectively cache data based on the size of the data and its reuse frequency.
  5. Data Serialization - choose an efficient serialization format like Apache Parquet to reduce data size and improve I/O performance.
  6. Resource Configuration and Allocation - utilize available compute resources effectively
  7. Broadcasting and Accumulators
  8. Partitioning and Bucketing - keep related data together, improving query performance on large datasets
  9. Monitoring and Tuning
  10. Code Optimization and Profiling
75
Q

How do you set up Spark using the Apache Spark Distribution?

A
  1. Setup Infrastructure (cluster of machines or virtual machines)
  2. Install Java on all machines in cluster. Apache Spark requires Java to run.
  3. Download Apache Spark, choose the package type that matches your cluster setup (e.g., pre-built for Hadoop, without Hadoop dependencies).
  4. Distribute Spark Package to all machines in cluster
  5. Configure Spark configuration files to suit your cluster setup (spark-defaults.conf) - memory allocation, cluster manager, and executor settings
  6. Start the Master Node - start the Spark master by running the following command on the chosen machine:
    ./sbin/start-master.sh
  7. Start Worker Nodes: On the remaining machines in the cluster, run:
    ./sbin/start-worker.sh <master-url></master-url>
  8. Verify Cluster Setup: Access the Spark web UI by opening a web browser and navigating to the URL of the Spark master node (e.g., http://<master-node-ip>:8080).
    The Spark web UI provides information about the cluster's status, worker nodes, and resource allocation.</master-node-ip>
  9. Once the Spark cluster is up and running, you can submit Spark applications or run Spark jobs by utilizing the Spark APIs or submitting applications using the spark-submit script.
76
Q

How does Spark handle data shuffling?

A

Data shuffling refers to the redistribution of data across partitions during certain operations, such as groupByKey or join.

Spark uses a combination of in-memory computing and disk-based operations to efficiently handle data shuffling, minimizing the amount of data transferred over the network.

77
Q

How do you connect to a Hive table with PySpark?

A

Step 1 – Import PySpark
Step 2 – Create SparkSession with Hive enabled
Step 3 – Read Hive table into Spark DataFrame using spark.sql()
Step 4 – Read using spark.read.table()
Step 5 – Connect to remove Hive.

78
Q

How can you optimise jobs in Databricks?

A
  1. Determine the optimal cluster size
  2. Auto-Scaling (automatically adjust cluster size)
  3. Choose the appropriate instance type for the workload and tune Spark configuration parameters (e.g. # cores)
  4. Dynamic Allocation of resources based on task demands
  5. Parallelism and Data Partitioning
  6. Caching and Persistence
  7. Data Compression
  8. Broadcast Variables (share small datasets across the cluster)
  9. Efficient Data I/O (read and write operations)
  10. Query Optimization (predicate pushdown, join optimization, and aggregate pushdown, to minimize data movement)
  11. Monitoring and Tuning- Analyze performance bottlenecks
  12. Lifecycle Management (start and stop cluster when required)