quiz3 Flashcards by Teeya Li

what is a compute cluster

several computers working together to do some work

How well did you know this?

Not at all

Perfectly

define the job of YARN
HDFS
Spark

How do they all work together

YARN: manages the compute jobs in a cluster
HDFS: stores data on cluster nodes
Spark: a framework to do computation on YARN

use spark to express the computation we want to do in a way that can be sent to a cluster and done in parallel. YARN will take this job and organize it and get it done, HDFS will store all the pieces of our data files

How well did you know this?

Not at all

Perfectly

true or false: spark dataframe is the same as panda’s dataframe

false

How well did you know this?

Not at all

Perfectly

when writing with a spark dataframe what does it create

it creates a directory with several files not just a file

How well did you know this?

Not at all

Perfectly

what does .filter() do on the dataframes

.filter() similar to SQL where

it keeps only rows who satisfy the filter condition

How well did you know this?

Not at all

Perfectly

what is a driver and executor in spark

a driver is the program u write, and an executor is the threads that run it

How well did you know this?

Not at all

Perfectly

true or false: parallelism is controlled by the way the data is partitioned

true

How well did you know this?

Not at all

Perfectly

there are a few ways you can control the number of partitions

describe them

spark.range(10000, numPartitions=6)

.coalese(num)

.repartition(num)

sets them explicitly
for concatenating if there are too many. will lower the amount of partitions,but not in the way you expect. can also be used to clean up your output if you know you are only small amount left in each partition
rearrange partitions, but expensive, since it is done to get perfect partitions, lots of memory moving

How well did you know this?

Not at all

Perfectly

name examples of shuffle operations

.repartition
.coalese
.groupBy
.sort

How well did you know this?

Not at all

Perfectly

what is a pipeline operation

the opposite of shuffle operations.

where the partitions can be handled completely independently from each other. ideally each row is handled independently, most dataframe operations are in this category.

How well did you know this?

Not at all

Perfectly

why is the groupby shuffling not as bad

because it reduces the amount of rows needed before the shuffling occures.

ex. if we have a billion rows and you grouped by 10 values then only 10 rows from each parition have to be shuffled

How well did you know this?

Not at all

Perfectly

why can spark constantly create dataframe

spark uses lazy evaluation

when you create a dataframe you actually didnt do the calculation needed to make it yet

How well did you know this?

Not at all

Perfectly

why is coalese not as smart as repartition

coalese does lazy evalution

repartition waits for everything before hand

How well did you know this?

Not at all

Perfectly

what is the downfall with lazy evaluation

how do you solve it

spark doesnt know if you are going to use the dame data twice, so it can throw away the values before it’s needed again, but you cant keep all intermediate results just incase bc it could be large

solution: caching it with the .cache() method. Tells the computer to store the results bc we are going to use it later

How well did you know this?

Not at all

Perfectly

whats the downfall of join

too much memory moving (potentially)

How well did you know this?

Not at all

Perfectly

true or false, in column expressions in spark, they are not lazily evaluated

Study These Flashcards

false, all column expressions in a spark dataframe are lazily evaluated

in a spark dataframe, the actual implementation is in _____ which complies to the ___________

Study These Flashcards

scala, java virtual machine

what are user defined functions in spark and why are they used and what is going on in the back for it to work

Study These Flashcards

when you want to use python to do the work can use functions.udf that can work on column objects, similar to np.vectorize

UDF is sent to the executors, data from the JVM is converted into python, the function is called in a python process, the results are sent back to the JVM

what are rdds: resilient distributed dataset

what are the keys things to remember on a RDD

Study These Flashcards

fundamental data structure that holds a row entry.

while working on an RDD, you do the work in python

each row is treated as strings

generally slower then dataframes

can be easier for extracting data in a non-dataframe friendly format

slower, you lose the JVM and optimizer

which is faster, row-oriented or column oriented?

Study These Flashcards

column, most operations you wanna do are gonna be on columns, you want good memory locality. columns in memory are arrays, pre-created and stored. Rows need to be made first

how can you turn a panda dataframe into a spark dataframe

how can you turn a spark dataframe into a panda dataframe

Study These Flashcards

use spark.createDataframe and give it a panda dataframe

use spkdata.toPandas()

Polars are a new dataframe tool. What is it implemented in and how is it evaluated?

what does it not have

Study These Flashcards

implemented in Rust, and strictly evaluated, but you an lazy to evaluate it, you need to create a lazy dataframe first tho. either read/write command for strict, scan/collect/write for lazy

no partitioning or clustering

what is duckDB and what is it good for

what does it not have

Study These Flashcards

duckDB is an in-process SQL database, it lets you create a relational database and do analytics with it without having to install an entire database server, but it can be used as a fast tool to manipulate tabular data

doesn’t have a compute cluster

what is dask

Study These Flashcards

a python data tool that recreates as much of Pandas/NumPy/etc as possible but does it with lazy evaluation and allows for distributed computation like spark. dask can also be deployed on a cluster

what is broadcast used for in spark

if you have one small data frame and want to join with another large data frame, rather than and shuffling a bunch. you can broadcast to essentially have a lookup table instead

in rdd, explain these functions df.rdd rdd.take(n) rdd.map(f) rdd.filter(f)

df.rdd : get equivalent rdd from a data frame rdd.take(n): retrieve the first n elements from the RDD as a Python list. rdd.map(f): apply function f to each element, creating a new RDD from the returned values. rdd.filter(f): apply function f to each element, keep rows where it returned True.

quiz3 Flashcards

(26 cards)