Pyspark Flashcards
A distributed process has
access to the computational resources across a number of machines connected through a network
Distributed machines also have the advantage of easily scaling, you can just add more machines
They also include fault tolerance…
Hadoop is a way …
MapReduce
to distribute very large files across multiple machines.
MapReduce distribute a computational task to a distributed data set
Spark is
one of the latest technologies being used to quickly and easily handle Big Data.
You can think of Spark as a flexible alternative to MapReduce.
Spark vs MapReduce
MapReduce requires files to be stored in HDFS, Spark doesn’t.
Spark also can perform operations up to 100x faster than MapReduce
Scala
Spark itself is not a programming language. It’s just a framework for dealing with large data and distributing it and doing those calculations across a distributor network Spark itself is written in a programming language known as Scala.
So the Scala API for Spark is the one that gets the latest features which makes sense because Spark has literally written in Scala.
Scala is written in Java.
DataBreaks : AWS
DataBreaks basically running on top of Amazon Web Services.
show DataFrame
df.show()
display(df)
schema
df.printSchema()
df column names
df.columns
df stat
df.describe().show()
a column values
df.select(‘net_bkd_pax’).show()
two column values
df_gdd.select([‘net_bkd_pax’,’net_ia_pax’]).show()
Add a new column
df.withColumn(‘new’, df[‘net_ia_pax’]*2).show()
These changes are not permanent on our original dataframe. You would have to save this to a new variable.
Rename a column
df.withColumnRenamed(‘net_ia_pax’, ‘new_name’).show()
create temp view
df_gdd.createOrReplaceTempView(‘gdd’)
result = spark.sql(“SELECT * FROM gdd”)
result.show()