quiz3 Flashcards
what is a compute cluster
several computers working together to do some work
define the job of YARN
HDFS
Spark
How do they all work together
YARN: manages the compute jobs in a cluster
HDFS: stores data on cluster nodes
Spark: a framework to do computation on YARN
use spark to express the computation we want to do in a way that can be sent to a cluster and done in parallel. YARN will take this job and organize it and get it done, HDFS will store all the pieces of our data files
true or false: spark dataframe is the same as panda’s dataframe
false
when writing with a spark dataframe what does it create
it creates a directory with several files not just a file
what does .filter() do on the dataframes
.filter() similar to SQL where
it keeps only rows who satisfy the filter condition
what is a driver and executor in spark
a driver is the program u write, and an executor is the threads that run it
true or false: parallelism is controlled by the way the data is partitioned
true
there are a few ways you can control the number of partitions
describe them
spark.range(10000, numPartitions=6)
.coalese(num)
.repartition(num)
- sets them explicitly
- for concatenating if there are too many. will lower the amount of partitions,but not in the way you expect. can also be used to clean up your output if you know you are only small amount left in each partition
- rearrange partitions, but expensive, since it is done to get perfect partitions, lots of memory moving
name examples of shuffle operations
.repartition
.coalese
.groupBy
.sort
what is a pipeline operation
the opposite of shuffle operations.
where the partitions can be handled completely independently from each other. ideally each row is handled independently, most dataframe operations are in this category.
why is the groupby shuffling not as bad
because it reduces the amount of rows needed before the shuffling occures.
ex. if we have a billion rows and you grouped by 10 values then only 10 rows from each parition have to be shuffled
why can spark constantly create dataframe
spark uses lazy evaluation
when you create a dataframe you actually didnt do the calculation needed to make it yet
why is coalese not as smart as repartition
coalese does lazy evalution
repartition waits for everything before hand
what is the downfall with lazy evaluation
how do you solve it
spark doesnt know if you are going to use the dame data twice, so it can throw away the values before it’s needed again, but you cant keep all intermediate results just incase bc it could be large
solution: caching it with the .cache() method. Tells the computer to store the results bc we are going to use it later
whats the downfall of join
too much memory moving (potentially)