Adswerve Study Guide Flashcards
What is BigQuery?
BigQuery is the data warehouse, the petabyte scale data warehouse on Google Cloud
BigQuery Formats
Avro, CSV, JSON(newline delimited), ORC, Parquet, Cloud Datastore Exports, Cloud Firestore exports
Parquet
A data format on HDFS, BQ compatible
ORC
(Optimized Row Columnar) A data format on HDFS, BQ compatible
Hadoop
Software framework for distributed storage and processing of big data
HDFS
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.
Dataproc: HDFS or GCS
Google reccomends using GCS for storage instead of HDFS
Why Google recommends PubSub over Kafka?
Better scaling, fully managed service
How long can Kafka retain messages
However long you configure it
How long can PubSub retain messages?
7 days
Is Kafka push or pull?
Pull
Is PubSub push or pull?
Both
Does Kafka guarantee ordering?
Yes in a partition
Does PubSub guarantee ordering?
No
Kafka Delivery Guarantee
At most once, at least once, exactly once (limited)
PubSub Delivery Guarantee
At least once for each subscription
Spark
Lives on Hadoop. Framework that uses RAM to process data
What is BQ default encoding?
UTF-8
Dataproc: is HDFS data persistant?
No, it goes away when the dataproc cluster is shut down
Dataproc: is GCS data persistant?
Yes, it remains even when a cluster is shut down
BigTable: What causes hotspotting?
Contiguous row keys. Example keys: 20190101 20190102 20190103 …
BigTable: How to prevent hotspotting?
Make row keys non contiguous. Example keys: a93js-20190101, vomdn-20190102, odsjs-20190103
What is ANSI SQL
Standard SQL in BQ
Dataproc: when to create a cluster?
Clusters are recommended to be job specific. Have a separate cluster for each job