04 Working With Spark Data Spurces & Sinks Flashcards
What are spark data sources.
Sources from where spark reads data.
Different types of data sources
- External data sources.
- Internal data sources.
Name some external data sources.
Oracle, SQL, Casandra, snowflake, redshift, khafka
Eg of internal storage system.
HDFS, Azure, Amazon S3
What are data sinks
Place where output is stored.it can be both internal and external.
Name file format where schema is well-defined.
AVRO & Parquet.
Can we mention schema in json file
no
in what files can we mention schema information in the file
parquet & AVRO
suggested file format to write in
Parquet
how can we manage data layout
by using partitioning we can partition data on 1 or more than 1 column using which we can organize data really well.
if there are partitioning how will data frame reader read it
data frame reader will read all the subdirectories
what will happen when we apply filter on partitioned data
in this case it will not go to all the files but it will go to the required data.