quiz3 Flashcards
what is a compute cluster
several computers working together to do some work
define the job of YARN
HDFS
Spark
How do they all work together
YARN: manages the compute jobs in a cluster
HDFS: stores data on cluster nodes
Spark: a framework to do computation on YARN
use spark to express the computation we want to do in a way that can be sent to a cluster and done in parallel. YARN will take this job and organize it and get it done, HDFS will store all the pieces of our data files
true or false: spark dataframe is the same as panda’s dataframe
false
when writing with a spark dataframe what does it create
it creates a directory with several files not just a file
what does .filter() do on the dataframes
.filter() similar to SQL where
it keeps only rows who satisfy the filter condition
what is a driver and executor in spark
a driver is the program u write, and an executor is the threads that run it
true or false: parallelism is controlled by the way the data is partitioned
true
there are a few ways you can control the number of partitions
describe them
spark.range(10000, numPartitions=6)
.coalese(num)
.repartition(num)
- sets them explicitly
- for concatenating if there are too many. will lower the amount of partitions,but not in the way you expect. can also be used to clean up your output if you know you are only small amount left in each partition
- rearrange partitions, but expensive, since it is done to get perfect partitions, lots of memory moving
name examples of shuffle operations
.repartition
.coalese
.groupBy
.sort
what is a pipeline operation
the opposite of shuffle operations.
where the partitions can be handled completely independently from each other. ideally each row is handled independently, most dataframe operations are in this category.
why is the groupby shuffling not as bad
because it reduces the amount of rows needed before the shuffling occures.
ex. if we have a billion rows and you grouped by 10 values then only 10 rows from each parition have to be shuffled
why can spark constantly create dataframe
spark uses lazy evaluation
when you create a dataframe you actually didnt do the calculation needed to make it yet
why is coalese not as smart as repartition
coalese does lazy evalution
repartition waits for everything before hand
what is the downfall with lazy evaluation
how do you solve it
spark doesnt know if you are going to use the dame data twice, so it can throw away the values before it’s needed again, but you cant keep all intermediate results just incase bc it could be large
solution: caching it with the .cache() method. Tells the computer to store the results bc we are going to use it later
whats the downfall of join
too much memory moving (potentially)
true or false, in column expressions in spark, they are not lazily evaluated
false, all column expressions in a spark dataframe are lazily evaluated
in a spark dataframe, the actual implementation is in _____ which complies to the ___________
scala, java virtual machine
what are user defined functions in spark and why are they used and what is going on in the back for it to work
when you want to use python to do the work can use functions.udf that can work on column objects, similar to np.vectorize
UDF is sent to the executors, data from the JVM is converted into python, the function is called in a python process, the results are sent back to the JVM
what are rdds: resilient distributed dataset
what are the keys things to remember on a RDD
fundamental data structure that holds a row entry.
while working on an RDD, you do the work in python
each row is treated as strings
generally slower then dataframes
can be easier for extracting data in a non-dataframe friendly format
slower, you lose the JVM and optimizer
which is faster, row-oriented or column oriented?
column, most operations you wanna do are gonna be on columns, you want good memory locality. columns in memory are arrays, pre-created and stored. Rows need to be made first
how can you turn a panda dataframe into a spark dataframe
how can you turn a spark dataframe into a panda dataframe
use spark.createDataframe and give it a panda dataframe
use spkdata.toPandas()
Polars are a new dataframe tool. What is it implemented in and how is it evaluated?
what does it not have
implemented in Rust, and strictly evaluated, but you an lazy to evaluate it, you need to create a lazy dataframe first tho. either read/write command for strict, scan/collect/write for lazy
no partitioning or clustering
what is duckDB and what is it good for
what does it not have
duckDB is an in-process SQL database, it lets you create a relational database and do analytics with it without having to install an entire database server, but it can be used as a fast tool to manipulate tabular data
doesn’t have a compute cluster
what is dask
a python data tool that recreates as much of Pandas/NumPy/etc as possible but does it with lazy evaluation and allows for distributed computation like spark. dask can also be deployed on a cluster
what is broadcast used for in spark
if you have one small data frame and want to join with another large data frame, rather than and shuffling a bunch. you can broadcast to essentially have a lookup table instead
in rdd, explain these functions
df.rdd
rdd.take(n)
rdd.map(f)
rdd.filter(f)
df.rdd : get equivalent rdd from a data frame
rdd.take(n): retrieve the first n elements from the RDD as a Python list.
rdd.map(f): apply function f to each element, creating a new RDD from the returned values.
rdd.filter(f): apply function f to each element, keep rows where it returned True.