BigQuery Flashcards
What is BigQuery?
- Serverless, highly-scalable, low-cost enterprise data warehouse
- No infrasturcture to manage, focus on analyzing data and being productive
- Allows capture and analysis of data in real-time using powerful streaming ingestion
- Free up to 1TB of data analyzed per month and 10GB data stored BigQuery is Google’s fully managed, petabyte scale, low cost analytics data warehouse.
BigQuery is NoOps—there is no infrastructure to manage and you don’t need a database administrator—so you can focus on analyzing data to find meaningful insights, use familiar SQL, and take advantage of our pay-as-you-go model.
How do the standard data-warehouse concepts map to BigQuery?
1) Data warehouse: BigQuery service replaces infrastructure, serves as collective home for all analytical data in an organization
2) Data mart: Datasets are collections of tables divided along business lines or anlystical domain, each is tied to a GCP projects
3) Data lake: Data may be in Cloud Storage or Google Drive, or transactional data in Cloud Bigtable. Big query can define a schema and issue queries directly on external data as federated data sources.
4) Tables and views: function the same as traditional data warehouse
5) Grants: Google Cloud Identity and Access Management (IAM) used to grant permission to perform specific actions in BigQuery
What are ‘datasets’ in BigQuery?
1) Organizes tables into units called datasets
2) One project can have multiple datasets, which in turn can have multiple tables
3) To reference a table in code or sql, goes through the construct ‘project.dataset.table’
How does provisioning and system sizing work in BigQuery?
- BigQuery allocates storage and query resources dynamically based on usage patterns
- Storage resources are allocated as you consume them and deallocated as you remove data or drop tables
- Query resources are allocated according to the query type and complexity. Each query uses a number of ‘slots’, which are units of computation that use a certain amount of CPU and RAM. Default is 2,000 slots for query operations.
How is storage managed in BigQuery?
- BigQuery stores data in a proprietary columnar format called ‘Capacitator’, designed to evolve and optimize query execution.
- Data is stored on Google’s distributed file-system, called Colossus, which ensures durability via storing redundant chunks on multiple physical disks.
- Data is also replicated to multiple data-centers. - Federated data sources allows you to use BigQuery on data outside BigQuery sorage, such as data stored in Cloud Storage, Drive or Bigtable These are not optimized for BigQuery, so may perform slower.
How does BigQuery maintenance impact me?
- fully managed, updates and maintenance are taken care of for you, updates require no downtime or impacts performance
What backup and recovery features does BigQuery provide?
- Maintains complete 7-day history of changes against tables, allows for querying a point-in-time snapshot of your data - Easily revert changes without performing recovery from backups - Note: Explicit table deletions flush history after 2-days
What should you consider when ‘Managing Worfklows’ with BigQuery?
- Organizing datasets
- Granting permissions via Cloud IAM and roles
- Onboarding, greatly accelerated productivity, centralized view of assets, file collaboration
- Managing workloads and concurrency:
- BigQuery limits the maximum incoming requests on a per-project basis based on policy and quota limitations
- Offers two types of query priorities:
- interactive - query is executed as soon as possible, counts toward query quotas
- batch - are queued and executed as soon as idle resources are available, usually within a few minutes
- Monitoring and auditing: Monitor with Stackdriver based on BigQuery metrics. BigQuery auto-creates audit logs of user actions, exportable to another BigQuery dataset for data analysis.
What should you consider when ‘Managing Data’ for BigQuery?
- schema design
- denormalization
- partitioning
- loading data into BigQuery
How does BigQuery support for queries?
- supports standard SQL queries and is compatible with ANSI SQL 2011
- SQL support extended to support nested and repeatable fields
- Support for interactive and automated queries, optimization, external resources, user-defined functions and query sharing
What is the difference between BigQuery and Bigtable?
The difference is basically this:
BigQuery is a query Engine for datasets that don’t change much, or change by appending. It’s a great choice when your queries require a “table scan” or the need to look across the entire database. Think sums, averages, counts, groupings. BigQuery is what you use when you have collected a large amount of data, and need to ask questions about it.
BigTable is a database. It is designed to be the foundation for a large, scaleable application. Use BigTable when you are making any kind of app that needs to read and write data, and scale is a potential issue.
What is a BigQuery statement made up of?
A series of tokens, which include:
- Identifiers
- names for db objects
- must start with a letter or underscore
- subsequent chars can be letters. numbers or underscores
- Quoted Identifiers
- enclosed with backticks, may contain any character
- uses same escaping as Literals
- Literals
- A constant value of a built-in data type
- Not all data types can be expressed as literals
- String and byte literals must be quoted with single, double or triple double-quotes OR single-quotes
- Keywords
- backticks must be used for any reserved keywords
- Keywords or case INsensitive
- Operators
- special characters or keywords which do manipulations and return results
- All operators return NULL when one of the operands is NULL
- All operators will throw an error if the computation result overflows
- Examples: . [] - ~ * / + - << >> & ^ = < > <== etc…
- Special Characters
- Single-line comments use # or –
- Multi-line comments use /* and */
What are the standard sql BigQuery data types?
- Integer: no decimal points; INT64, 8 bytes
- Floating point: FLOAT64, 8 bytes of double precision
- Boolean: keywords TRUE or FALSE
- String: variable length character data, UTF-8 encoded
- Bytes: variable length binary data;
- Date: logical calendar date, no timezone
- Datetime: a point in time, civil time (what you would see on watch)
- Time: just time, independent of date
- Timestamp: absolute point in time, microsecond precision, no timezone, derfault is UTC
- Array: Array<t> Ordered list of zero or more elements of any non-ARRAY type.; types are declared using angle brackets <>, complex types, may not contain NULL 'elements', arrays of arrays are not allowed; a NULL array response is translated to empty but are two distinct values</t>
- Struct: STRUCT<t> Container of ordered fields each with a type (required) and field name (optional).; types declared using angle brackets <> and may be complex types</t>
What are the ways to construct an Array in BigQuery?
- Array literals: SELECT [1,2,3] from numbers;
- Array keyword: SELECT Array<float64>[1,2,3] as floats;
</float64><ul>
<li>not required to declare for most data types like INT64 or STRING</li>
</ul></float64> - Empty Array of a specific type: Array<t>[] pr just [] and BigQuery will attempt to infer it from the data. If it can not, it will default to INT64.</t>
- Generated values: GENERATE_ARRAY function: SELECT GENERATE_ARRAY(11,32,2) as odds; the last param is a step value, which can be negative to go in descending order. There is also GENERATE_DATE_ARRAY that takes an date range and interval params.