Pyspark Flashcards

Question 1

Q

A distributed process has

Answer

A

access to the computational resources across a number of machines connected through a network

Distributed machines also have the advantage of easily scaling, you can just add more machines

They also include fault tolerance…

Question 2

Q

Hadoop is a way …

MapReduce

Answer

A

to distribute very large files across multiple machines.

MapReduce distribute a computational task to a distributed data set

Question 3

Q

Spark is

Answer

A

one of the latest technologies being used to quickly and easily handle Big Data.

You can think of Spark as a flexible alternative to MapReduce.

Question 4

Q

Spark vs MapReduce

Answer

A

MapReduce requires files to be stored in HDFS, Spark doesn’t.

Spark also can perform operations up to 100x faster than MapReduce

Question 5

Q

Scala

Answer

A

Spark itself is not a programming language. It’s just a framework for dealing with large data and distributing it and doing those calculations across a distributor network Spark itself is written in a programming language known as Scala.

So the Scala API for Spark is the one that gets the latest features which makes sense because Spark has literally written in Scala.

Scala is written in Java.

Question 6

Q

DataBreaks : AWS

Answer

A

DataBreaks basically running on top of Amazon Web Services.

Question 7

Q

show DataFrame

Answer

A

df.show()

display(df)

Question 8

Q

schema

Answer

A

df.printSchema()

Question 9

Q

df column names

Answer

A

df.columns

Question 10

Q

df stat

Answer

A

df.describe().show()

Question 11

Q

a column values

Answer

A

df.select(‘net_bkd_pax’).show()

Question 12

Q

two column values

Answer

A

df_gdd.select([‘net_bkd_pax’,’net_ia_pax’]).show()

Question 13

Q

Add a new column

Answer

A

df.withColumn(‘new’, df[‘net_ia_pax’]*2).show()

These changes are not permanent on our original dataframe. You would have to save this to a new variable.

Question 14

Q

Rename a column

Answer

A

df.withColumnRenamed(‘net_ia_pax’, ‘new_name’).show()

Question 15

Q

create temp view

Answer

A

df_gdd.createOrReplaceTempView(‘gdd’)
result = spark.sql(“SELECT * FROM gdd”)
result.show()

Question 16

Q

read csv

Answer

Study These Flashcards

A

df = spark.read.csv(“address”, header=True, inferSchema=True)

Question 17

Q

head

Answer

Study These Flashcards

A

df.head(3)[0]

Question 18

Q

filter

Answer

Study These Flashcards

A

df.filter(“runid<10”).show()

Question 19

Q

filter and show one column

Answer

Study These Flashcards

A

df.filter(“runid<10”).select(“forecastname”).show()

df.filter(df[“runid”]<10).select(“forecastname”).show()

Question 20

Q

filter based on two conditions

Answer

Study These Flashcards

A

df.filter((df[“runid”]<10) & (df[“start_date”]<20)).select(“forecastname”).show()

used () for each condition

Question 21

Q

.collect

Answer

Study These Flashcards

A

if we use .collect() than .show(), we will get a list and the data can be used in future operations

Question 22

Q

list to dic

Answer

Study These Flashcards

A

result.asDict()

Pyspark Flashcards

(22 cards)