7_BigQuery Flashcards
BigQuery Basics
- Peta-byte scale, serverless, highly-scalable cloud entreprise data warehouse.
- In-memory BI Engine (BigQuery BI Engine).
- Accepts batch and streaming loads.
- Machine learning capabilities (BigQuery ML).
- Support for geospatial data storage and processing.
BigQuery Key Features
- High Availability: Automatic data replication to multiple locations with availability.
- Supports Standard SQL: BigQuery supports a standard SQL dialect which is ANSI:2011 compliant.
- Federated Data: BigQuery able to connect to, and process, data stored outside of BigQuery (external sources).
- Automatic Backups: BigQuery automatically replicates data and keeps a seven-day history of changes.
- Governance and Security: Fine-grained identity and access management. Data encrypted at rest and in transit.
- Separation of Storage and Compute: Leads to a number of benefits including ACID-compliant storage operations, cost-effective scalable storage and stateless resilient compute.
BigQuery Structure
-
Project: contains users and datasets.
- Use a project to:
- Limit access to datasets and jobs
- Manage billing
- Use a project to:
-
Dataset: contains tables/views
- Access Control Lists for Reader/Writer/Owner
- Applied to all tables/views in dataset
-
Table: collection of columns
- Columnar storage
- Views are virtual tables defined by SQL query
- Tables can be external (e.g. on Cloud Storage)
-
Job: long running action/query that is run in BigQuery on your behalf (asynchronously)
- Can be cancelled
BigQuery Jobs
- Jobs (queries) can scale up to thousands of CPU’s across many nodes, but the process is completely invisible to the end user.
- Storage and compute are separated, connected by petabyte network.
-
4 types of jobs:
- Load
- Export
- Query
- Copy
-
Query job priority:
- Interactive: Default. Queries are executes asap. Results always saved to either Temporary or Permanent table.
- Batch: Queries are executed when there are idle resources. If not executed within 24 hours, then it becomes interactive.
Table Storage
-
Capacitor columnar data format
- Each column in a separate, compressed, encrypted file that is replicated 3+ times
- No indexes, keys or partitions required
- For immutable, massive datasets
- Separates records into column values, stores each value on different storage volume.
- Traditional RDBMS stores whole record on one volume.
- Extremely fast read performance, poor write (update) performance - BigQuery does not update existing records.
- Not transactional.
- Table schemas specified at creation of table or at data load.
BigQuery Capacitor
Capacitor storage system
- Proprietary columnar data storage that supports semi-structured data (nested and repeated fields).
- Each value stored together with a repetition level and a definition level.
- Repetition level: the level of the nesting in the field path at which the repetition is happening.
- Definition level: how many optional/repeated fields in the field path have been defined.
IAM
- Control by project, dataset, table/view (table level since June 2020)
- Roles:
-
Basic: grant access at Project level
- Owner, Editor, Viewer
-
Predefined:
- Admin, User: Project level
- Data Owner, Data Editor, Data Viewer: Table/View level
- Custom: user managed
-
Basic: grant access at Project level
Denormalisation
- BigQuery performance optimised when data is denormalised appropriately
- Nested and repeated columns
- Maintain data relationships in an efficient manner
- RECORD (STRUCT) data type
Data Ingest Solutions: CLI, web UI, or API
Load and Read Data
Load:
- CSV
- JSON (Newline delimited)
- Avro - best for compressed files
- Parquet
- ORC
- Datastore backups/exports
Read:
- From external source:
- Cloud Storage
- Cloud Bigtable
- Google Drive
- Data format:
- CSV
- JSON (Newline delimited)
- Avro
- Parquet
Three ways of loading data into BigQuery
External Sources
Use cases:
- Load and clean data in one pass from external, then write to BigQuery
- Small amount of frequently changing data to join to other tables
Limitations
- No guarantee of consistency
- Lower query performance
- Cannot use TableDataList API method
- Cannot run export jobs on external data
- Cannot reference in wildcard table query
- Cannot query Parquet or ORC formats
- Query results not cached
- Limited to 4 concurrent queries
- Cannot import directly from Cloud SQL, need to build a pipeline for that
https://cloud.google.com/bigquery/external-data-sources
https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries
Data Transfer Service
- Import data to BigQuery from other Google advertising SaaS applications.
- Google AdWords
- DoubleClick
- YouTube reports
https://cloud.google.com/bigquery-transfer/docs/introduction
Exporting Tables
- Can only export to Cloud Storage
- Can copy table to another BigQuery dataset
- Export formats: CSV, JSON, Avro
- Can export multiple tables with command line
- Can only export up to 1Gb per file, but can split into multiple files with wildcards
- Use Dataflow to read data from BigQuery instead of manually exporting it if you need scheduled exports.
Schema Modification
BigQuery natively supports the following schema modifications:
- Adding columns to a schema definition
- Relaxing a column’s mode from REQUIRED to NULLABLE
It is valid to create a table without defining an initial schema and to add a schema definition to the table at a later time.
All other schema modifications are unsupported and require manual workarounds, including:
- Changing a column’s name
- Changing a column’s data type
- Changing a column’s mode (aside from relaxing REQUIRED columns to NULLABLE)
- Deleting a column