Adswerve Study Guide Flashcards
What is BigQuery?
BigQuery is the data warehouse, the petabyte scale data warehouse on Google Cloud
BigQuery Formats
Avro, CSV, JSON(newline delimited), ORC, Parquet, Cloud Datastore Exports, Cloud Firestore exports
Parquet
A data format on HDFS, BQ compatible
ORC
(Optimized Row Columnar) A data format on HDFS, BQ compatible
Hadoop
Software framework for distributed storage and processing of big data
HDFS
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.
Dataproc: HDFS or GCS
Google reccomends using GCS for storage instead of HDFS
Why Google recommends PubSub over Kafka?
Better scaling, fully managed service
How long can Kafka retain messages
However long you configure it
How long can PubSub retain messages?
7 days
Is Kafka push or pull?
Pull
Is PubSub push or pull?
Both
Does Kafka guarantee ordering?
Yes in a partition
Does PubSub guarantee ordering?
No
Kafka Delivery Guarantee
At most once, at least once, exactly once (limited)
PubSub Delivery Guarantee
At least once for each subscription
Spark
Lives on Hadoop. Framework that uses RAM to process data
What is BQ default encoding?
UTF-8
Dataproc: is HDFS data persistant?
No, it goes away when the dataproc cluster is shut down
Dataproc: is GCS data persistant?
Yes, it remains even when a cluster is shut down
BigTable: What causes hotspotting?
Contiguous row keys. Example keys: 20190101 20190102 20190103 …
BigTable: How to prevent hotspotting?
Make row keys non contiguous. Example keys: a93js-20190101, vomdn-20190102, odsjs-20190103
What is ANSI SQL
Standard SQL in BQ
Dataproc: when to create a cluster?
Clusters are recommended to be job specific. Have a separate cluster for each job
Which is simpler? cbt or hbase shell
cbt
Databases: What does an index do?
Improves the search speed of a specific column
What is an MID?
Machine-generated IDentifier. A unique identifier for an entity in Google’s Knowledge Graph
BQ: Does LIMIT clause reduce cost?
No. All the data will still be queried and billed
BQ: Can you change an existing table to use partitions?
No. You must create a partitioned table from scratch
BQ: which column specifies a partition?
_PARTITIONTIME
BQ: which column specifies a shard (wildcard)?
_TABLE_SUFFIX
BQ: At what levels can you control access?
Project and Dataset
BQ: Can you limit access to a table?
No. Dataset is the most granular level of access