Instructor's Method - 6/16/2021 Flashcards
Azure Data Factory
Ingest - copy data 90+ different sources
Transform - map data flows. Code-free workflows
Orchestration - end-to-end workflows =>
This is a Standalone-Tool
Diff bet Azure Data Factory and Synapse Pipeline
Barely - they have same codebase
Mapping Data Flows
Visually design ETL pipelines
Once data flow is created, run in
Spark
Open-source data processing engine built around speed, ease of use, and sophisticated analytics
Compute engine designed for distributed data processing at scale
In-memory engine that is up to 100 times faster than Hadoop
Largest open-source data project
Multi-language support - Scala, Java, SQL, R & Python
Spark SQL
Batch processing
Spark Streaming
Stream processing
Spark Pool
Node size
Number of nodes
Apache Spark version
Different library versions that will be installed
Spark Pool
Node size
Number of nodes
Apache Spark version
Different library versions that will be installed
Auto-pause (auto-termination time)
How to set header
.option(“header”,”true”)
Which is Native Spark function to analyze data
describe()
What is display() function used for
in-built function of spark to show data in data frame
Difference between show() and display()
show() is native function of spark
display() is from Synapse and shows data in html
Show data in text format
spark.read.parquet(“filename”)
Show data in tabular format
using display(Dframe, True)