ACG Notes Flashcards

Question 1

Q

3 V’s of Big Data

Answer

A

) Volume – Scale of data being handled by systems (can it be handled by a single server?)
) Velocity – speed in which its being processed
) Variety – The diversity of data sources, formats, and quality

Question 2

Q

What is a Data Warehouse?

Answer

A

) Data Warehouse
a. Structured and/or processed
b. Ready to use
c. Rigid structures – hard to change, may not be the most up to date either

Question 3

Q

What is a data lake?

Answer

A

) Data Lake
a. Raw and/or unstructured
b. Ready to analyze – more up to date but requires more advanced tools to query
c. Flexible – no structure is enforced

Question 4

Q

4 Stages of a Data Pipeline

Answer

A

) Ingestion
) Storage
) Processing
a. ETL – Data is taken from a source, manipulated to fit the destination
b. ELT – data is loaded into a data lake and transformations can take place later
c. Common transformations: Formatting / Labeling / Filtering / Validating
) Visualization

Question 5

Q

Cloud Storage

Answer

A

o	Unstructured object storage
o	Regional, dual-region, or multi-region
o	Standard, nearline, or cold line
o	Storage event triggers (pub/sub)
	(Usually, first steps in a cloud data pipeline)

Question 6

Q

Cloud Bigtable

Answer

A

o Petabyte-scale NoSQL database
o High-throughput and scalability
o Wide column key/value data
 Time-series, transactional, IoT data

Question 7

Q

Cloud BigQuery

Answer

A

o Petabyte-scale analytics DW
o Fast SQL queries across large datasets
o Foundations for BI and AI
o Useful public datasets

Question 8

Q

Cloud Spanner

Answer

A

o Global SQL-based relational database
o Horizontal scalability and HA
o Strong consistency
o Not cheap to run, usually used in financial transactions

Question 9

Q

Cloud SQL

Answer

A

o Managed MySQL, PostgreSQL, and SQL Server instances
o Built-in backups, replicas, and failover
o Does not scale horizontally, but does scale vertically

Question 10

Q

Cloud Firestore

Answer

A

o Fully managed NoSQL document database
o Large collections of small JSON documents
o Realtime database with mobile SDKs
o Strong consistency

Question 11

Q

Cloud Memorystore

Answer

A

o Managed Redis instances
o In-memory DB, cache, or message broker
o Bult-in HA
o Vertically scalable with increasing the amount of RAM

Question 12

Q

Cloud Storage (GCS) at a high level

Answer

A

o Fully managed object storage
o For unstructured data: Images, videos, etc.
o Access via API or programmatic SDK
o Multiple storage classes
o Instant access in all classes, also has lifecycle management
o Secure and durable (HA and maximum durability)

Question 13

Q

GCS Concepts, what is GCS, where can buckets be?

Answer

A

o A bucket is a logical container for an object
o Buckets exist within projects (named within a global namespace)
o Buckets can be:
o Regional $
o Dual-regional $$ (HA)
o Multi-regional $$$ (Uses all datacenters in a region, lowest latency as well)

Question 14

Q

4 GCS Storage Classes

Answer

A

o Standard
o Nearline
o Coldline
o Archive

Question 15

Q

Describe standard GCS Storage Class

Answer

A

$0.02 per GB
99.99% regional availability
>99.99% availability in multi and dual-regions

Question 16

Q

Describe Nearline GCS Storage Class

Answer

A

30 day minimum storage
$0.01 per GB up / down
99.9% regional availability
99.95% availability in multi and dual regions

Question 17

Q

Describe Coldline GCS Storage Class

Answer

A

90 days minimum storage
$.004 per GB stored
$02 up/down
99.9% regional availability
99.95% availability in multi and dual region

Question 18

Q

Describe Archive GCS Storage Class

Answer

A

365 minimum storage
$.0012 per GB stored
$0.05 per GB up/down
99.9% regional availability
99.95% availability in multi and dual regions

Question 19

Q

Objects in cloud storage (encryption, changes)

Answer

A

o Encrypted in flight and at rest
o Objects are immutable to change you must overwrite (atomic operation)
o Objects can be versioned

Question 20

Q

name the 5 “advanced” features of GCS

Answer

A

o Parallel uploads of a single object
o Integrity checking – pre-calculate an md5 hash, compared to the one Google calculates
o Transcoding for compression
o Requester can pay, if desired
o Pub/Sub notifications
 New files are commits that trigger a data pipeline

Question 21

Q

What is Cloud Transfer Service?

Answer

A

o Transfers from a source to a sink (bucket), supported sources: S3, HTTP, GCP Storage
o Transfers can be filtered based on names/dates
 Schedule it for one run or periodically (can delete in source or destination after transfer is confirmed)

Question 22

Q

What is BigQuery Data Transfer Service?

Answer

A

o Automates data transfer to BigQuery
o Data is loaded on a regular basis
o Backfill can recover from gaps or outages
o Supported sources: Cloud Storage, Merchant Center, Google Play, S3, Teradata, Redshift

Question 23

Q

What is a transfer appliance?

Answer

A

physical rack storage device, 100 TB and 480 TB versions

Question 24

Q

What are the top 3 features of Cloud SQL?

Answer

A

o Managed SQL instances (creation, replication, backups, patches, updates)
o Multiple DB engines (MySQL, PostgreSQL, SQL Server)
o Scalability – vertically to 64 cores and 416 GB of RAM, HA options are available

Question 25

Q

Describe regional configuration of Cloud SQL

Answer

A

o Regional Replication:
 3 read-write replicas
 Every mutation requires a write quorum
This is different from traditional HA in that it’s a read AND a write replica in each zone.

Question 26

Q

Regional Cloud SQL best practices

Answer

A

o Design a performant schema
o Spread reads/writes around the database, avoid write hot spots
o Co-locate compute workloads in the same region
o Provision nodes to keep average CPU utilization under 65%

Question 27

Q

Multi-regional Cloud SQL benefits

Answer

A

 5 9s SLA – 99.9999%
 Reduce latency with distributed data
 External consistency
• Concurrency control for transactions, this guarantees transactions are executed sequentially even across the globe

Question 28

Q

Multi-regional Cloud SQL best practices:

Answer

A

 Design a performant schema to avoid hotspots
 Co-locate write-heavy compute workloads in the same region as the leader
 Spread critical workloads cross two regions
 Provision nodes to keep average CPU under 45%

Question 29

Q

Cloud SQL data model

Answer

A

• Data model:
o Relational database tables
o Strongly typed (you must conform to a strict schema)
o Parent-child relationships, declared with primary keys and create an interleaved table

Question 30

Q

Cloud SQL transactions

Answer

A

•	Transactions:
o	Locking read-write
o	Read-Only
o	Partitioned DML
o	Regular transactions using ANSI SQL best practices

Question 31

Q

Top 2 features of Cloud MemoryStore

Answer

A

o Fully managed Redis instance
o 2 tiers:
 Basic tier – make sure your app can withstand full data flush
 Standard tier – adds cross zone replication and automatic failover

Question 32

Q

Benefits of managed Redis (MemoryStore)

Answer

A

o No need to provision VMs
o Scale instances with minimal impact
o Private IPs and IAM
o Automatic replication and failover

Question 33

Q

Cloud MemoryStore use cases

Answer

A

o Session cache – store logins or shopping carts
o Message queue – loosely couple micro services
o Pub/sub – message queue, but also look at pub/sub

Question 34

Q

Storage options: Low latency vs. Warehouse

Answer

A

Low latency (use Cloud Bigtable)

- petabyte scale
- single-key rows
- Time series or IoT data

Warehouse (use BigQuery)

Petabyte scale
Analytics warehouse
SQL queries

Question 35

Q

Storage options: Horizontal vs. Vertical scaling

Answer

A

Horizontal scaling (Cloud Spanner)

ANSI SQL
Global replication
High Availability and consistency

Vertical Scaling (use Cloud SQL)

MySQL or PostgreSQL
Managed service
High availability

Question 36

Q

Storage options: NoSQL vs Key/Value

Answer

A

NoSQL

Fully managed document database
Strong consistency
Mobile SDKs and offline data

Key/Value

Managed Redis instances
Does what Redis does

Question 37

Q

What is MapReduce

Answer

A

distributed implementation of the map and reduce programming model, common interface to program these operations while abstracting away all of the systems management

Question 38

Q

4 Core modules of Hadoop & HDFS

Answer

A

 Hadoop Common – base files
 Hadoop Distributed File System (HDFS) – distributed fault tolerant file system
 Hadoop YARN – resource management / job scheduling
 Hadoop MapReduce – their own implementation

Question 39

Q

Apache Pig

Answer

A

language for analyzing large datasets, essentially an abstraction for MapReduce
o High level framework for running MapReduce jobs on Hadoop clusters

Question 40

Q

Apache Spark

Answer

A

general purpose cluster-computing framework
o Pretty much replaced Hadoop, uses Hadoop, much faster
 Hadoop stores data in blocks on disk before, during, and after computation
 Spark stores data in memory, enabling parallel operations on that data

Question 41

Q

Hadoop vs. Spark

Answer

A

Hadoop

Slow disk storage
High latency
Used for: slow, reliable batch processing

Spark

Fast memory storage
Low latency
Stream processing
100x faster in-memory
10x faster on disk

Question 42

Q

Apache Kafka

Answer

A

distributed streaming platform, designed for high-throughput and low-latency pub/sub stream of records
o Handles >800 billion messages per day at LinkedIn

Question 43

Q

Kafka vs. Pub/Sub

Answer

A

Kafka

Guaranteed message ordering
Tuneable message retention
Polling (Pull) subscriptions only
Unmanaged

Pub/Sub

No message ordering guarantee
7 day maximum message retention
Pull or Push subscriptions
Managed

Question 44

Q

Top 2 Benefits of Pub/Sub

Answer

A

o Global messaging and event ingestion

o Serverless and fully managed, processes up to 500 million messages per second

Question 45

Q

Top 4 features of Pub/Sub

Answer

A

o Multiple pub/sub patterns, one to many, many to one, and many to many
o At least once delivery is guaranteed
o Can process messages in real-time or batch with exponential backoff
o Integrates with Cloud Dataflow

Question 46

Q

Pub/Sub use cases

Answer

A

o	Distributing workloads
o	Asynchronous workflows – order processing, order, packaging, shipping
o	Distributing Event Notifications
o	Distributed Logging
o	Device Data Streaming

Question 47

Q

2 types of delivery method for pub/sub subscriptions:

Answer

A

o	Pull (default) – ad-hoc request, messages must be acknowledged, or they will remain at the top of the queue and you won’t get the next message
o	Push – will send new messages to an endpoint, must be HTTS with a valid cert

Question 48

Q

Pub/Sub integration facts

Answer

A

o Fully supported by Cloud Dataflow
o Client libraries for popular languages (python)
o Cloud Functions can be triggered by events
o Cloud Run to be the receiver of a push sub
o IoT Core

Question 49

Q

Pub/Sub delivery model

Answer

A

o You may receive a message more than once in a single subscription
o Message Retention Duration (default 7 days) – undelivered messages are deleted

Question 50

Q

Pub/Sub lifecycle – when does a sub expire?

Answer

A

if No pulls / no pushes – subs expire after 31 days

Question 51

Q

Standard pub/sub model limitations

Answer

A

o Acknowledged messages are no longer available to subscribers
o Every message must be processed by a subscription

Question 52

Q

Pub/Sub: Seek

Answer

A

o Seek – you can rewind the clock and retrieve old messages up to the retention window, you can also use this to seek to a point in the future
 Useful in case of an outage most commonly

Question 53

Q

Pub/Sub: Snapshot

Answer

A

o Snapshot – save the current state of the queue, this enables replay
 Useful if you have new code and you’re not sure what it’ll do, you can snapshot back to a certain point and move forward, then seek back to replay

Question 54

Q

Pub/Sub: Ordering messages

Answer

A

o Ordering messages – Use timestamps when final order matters, order still isn’t guaranteed but you do have a record of that time.
 If you absolutely must guarantee order, consider an alternative system

Question 55

Q

Pub/Sub: Access Control

Answer

A

o Use service accts for authorization, granting per-topic or per-subscription permissions
 Grant limited access to publish or consume messages

Question 56

Q

Define: Cloud Dataflow

Answer

A

• Cloud Dataflow – fully managed, serverless ETL tool, using Apache Beam.
o Supports:
 SQL, Java, and Python
 Real-time and batch processing

Question 57

Q

Define: Pipeline Lifecycle

Answer

A

•	You can run pipelines on your local machine, this is the preferred way to fix bugs
•	Pipeline design considerations:
o	Location of data
o	Input data structure and format
o	Transformation objectives
o	Output data structure and location

Question 58

Q

Dataflow: ParDo

Answer

A

• ParDo – defines the distributed operation/transformation to be performed on the PCollection of data. These can be user defined functions or pre-defined.

Question 59

Q

Dataflow: PCollections (Characteristics)

Answer

A

o Data types – may be of any data type but must all be of the same type. The SDK includes built-in encoding
o Access – individual access to elements is not supported, transforms are performed on all
o Immutable – cannot be changed once created
o Boundedness – no limit to the number of elements that a PCollection can contain
o Timestamp – associated with every element of the collection, assigned by the creation

Question 60

Q

Dataflow: Core Beam transforms (6)

Answer

A

o ParDo – generic parallel processing transforms
o GroupByKey – processing collections of KVP’s
o CoGroupByKey – used when combining multiple key collections, performs relational join
o Combine – requires you to provide a function to provide the logic, multiple pre-built functions are available, sum, min, max…
o Flatten – Multiple collections become one
o Partition – how the elements of the PCollection are split up

Question 61

Q

Dataflow Security Mechanisms

Answer

A

o Only users with permission can submit pipelines
o Any temp data during execution is encrypted
o Any communication between workers happens on a private network
o Access to telemetry or metrics is controlled by project permissions

Question 62

Q

Describe GCP Service Account usage

Answer

A

Cloud Dataflow service uses the Dataflow Service Account
 Account is automatically created on flow creation
 Manipulates job resources on your behalf
 Assumes the “Cloud Dataflow service agent role”
 Read/write access to project resources (recommended not to change this)

Worker instances will use the Controller service account
 Used for metadata operations (ex: determine size of file on storage)
 Also, able to use user-managed controller service acct, enabling fine grained access-control

Question 63

Q

Dataflow: Regional Endpoints

Answer

A

• Regional Endpoints – specifying a regional endpoint means all the worker instances will persist in that region, this is best for:
o Security and compliance
o Data locality
o Resiliency

Question 64

Q

What use case does Dataflow address

Answer

A

o Used for migrating MapReduce jobs to Cloud Dataflow

Answer 65

A

o Develop and run Cloud Dataflow jobs from the BigQuery web UI

Answer 66

A

o Integrates with Apache Beam SQL
 Apache Beam SQL:
• Can query bounded and unbounded PCollections
• Query is converted to a SQL transformation
 Cloud Dataflow SQL benefits:
• Join streams with BigQuery tables
• Query streams or static datasets
• Write output to BigQuery for analysis and visualization

Answer 67

A

• Go with the flow – ideal solution for customers using Apache Beam
o Batch == Dataproc & spark
o Streaming == Beam and Dataflow

Answer 68

A

o The pipeline represents the complete set of stages required to read, transform, and write data using the Apache Beam SDK
o PCollection – represents a multi-element dataset that is processed by the pipeline

Answer 69

A

o ParDo – Core parallel processing function of Apache beam which can transform elements of an input PCollection into an output PCollection, can invoke UDF’s
o DoFn – template you use to create UDF that are referenced by a ParDo

Answer 70

A

allows streaming data to be grouped into finite collections according to time or session-based windows
 Useful when you need to impose ordering or constraints on pub/sub data

Answer 71

A

o Watermark – indicates when a Dataflow expects all data in a window to have arrived
 Data that arrives with a timestamp that’s inside the window but past the watermark is considered late, policies can decide what happens to this.

Answer 72

A

o Dataflow is normally the preferred option for data pipelines
o Composer may sometimes be used for ad-hoc orchestration or to provide manual control of Dataflow pipelines themselves

Answer 73

A

• Triggers – determine when to emit aggregated results as data arrives.
o For bounded data, results are emitted after all the input has been processed.
o For unbounded data, results are emitted when the watermark passes the end of the window, indicating that the system believes all input data for that window has been processed.

Answer 74

A

Peta-byte scale, serverless, highly-scalable cloud enterprise DW

Answer 75

A

o Highly available – automatic data replication
o Supports standard SQL – ansi compliant
o Supports Federated Data – connects to several external sources
o Automatic Backups – automatically replicates data and keeps a 7-day history of changes
o Support for Governance and Security – fine-grained IAM
o Separation of storage and compute – ACID compliant, stateless compute

Answer 76

A

Project
 Dataset – container for tables/views, think of this like a database
• Native table – standard table, where data is held in BQ storage
• External tables – backed by storage outside of BQ
• Views or virtual tables – created by a SQL Query

Answer 77

A

Real-time events
 Generally streamed using pub/sub, then use cloud dataflow to process these and push to BQ

Batch sources
 Push files to cloud storage, have cloud dataflow pick up the data and put in BQ

Answer 78

A

action that is run in BigQuery on your behalf asynchronously

o 4 types: Load / Export / Query / Copy

Answer 79

A

• Importing data: supported formats: csv, json, Avro, Parquet, ORC, Datastore/Firestore exports

Answer 80

A

o Control access to data
o Reduce query complexity
o Can be used to construct logical tables
o Enables authorized views – users can have access to different subsets of rows

Answer 81

A

 Cannot export data from a view
 Cannot use JSON API to retrieve data from a view
 No UDF’s
 Limited to 1k authorized views per dataset

Answer 82

A

Supports BigTable, Cloud Storage, and Google Drive
o Use cases:
 Load and clean data in one pass
 Small, frequently changing data joined with other tables

Answer 83

A

Limitations:
	No guarantee of consistency 
	Lower query performance
	Cannot run export jobs on external data
	Cannot query Parquet or ORC formats
	Results not cached
	Limited to 4 concurrent queries

Answer 84

A

 Ingestion time partition tables

 Partitioned tables

Answer 85

A

Partitioned by load or arrival date
Data automatically loaded into date-based partitions (daily)
Tables include the pseudo-column _PARTITIONTIME
use _PARTITIONTIME in queries to limit partitions scanned (in the where)

Answer 86

A

• Partitioning is based on specific TIMESTAMP or DATE column
• Data partitioned based on value supplied in partitioning column
• 2 additional partitions:
o __null__
o __UNPARTITIONED__
• Use partitioning column in queries

Answer 87

A

 Improve query performance, less data is read/processed

 Cost control, you pay for all the data processed by a query

Answer 88

A

o Like creating an index on the table
o Supported for both types of partitioning, unsupported for non-partitioned tables
o Create a cluster key on frequently accessed columns (order is important, it’s the exact order you’ll be accessing the data)

Answer 89

A

 Specify clustering columns only on table create, cannot be changed
 Clustering columns must be top-level, non-repeated columns
 Only supported for partitioned tables
 Can specify one to four clustering columns

Answer 90

A

 Filter clustered columns in the order they were specified:
 Avoid using clustered columns in complex filter expressions
 Avoid comparing cluster columns to other columns

Answer 91

A

Slots – unit of computational capacity required to execute SQL queries
o Play a role in pricing and resource allocation
o Determined by:
 Query size
 Query complexity (amount of info shuffled)
o Automatically managed

Answer 92

A

Controlling Costs
Query Performance
Optimizing Storage

Answer 93

A

 Avoid SELECT * (columns are stored separately)
 Use preview options to sample data
 Price queries before executing them (price is per byte), there’s a calculator
 Using LIMIT does not affect costs, the full columns are read and processed
 View costs using a dashboard and query audit logs
 Partition by date in your queries
 Use streaming inserts with caution, costly, go with bulk loads

Answer 94

A

• Prune partitioned queries (don’t query partitions you don’t need)
• De-normalize data whenever possible
• Use external data sources appropriately
• Avoid wildcard tables
o A wildcard table represents a union of all the tables that match the wildcard expression.

Answer 95

A

Avoid repeatedly transforming data via SQL queries
Avoid JS UDF’s
Order query operations to maximize performance
Optimize JOIN patterns

Answer 96

A

Avoid self-joins
Avoid data skew
Avoid unbalanced joins
Avoid joins that generate more outputs than inputs
Avoid DML statements that update or insert single rows

Answer 97

A

• Use expiration settings (tables auto deleted after expiration)
• Take advantage of long-term storage
o Lower monthly charges apply for data stored in tables or in partitions that have not been modified in 90 days

Answer 98

A

3 Types of roles relating to BigQuery
o Primitive – defined at the project level
3 types: Owner, Editor, Viewer
o Predefined – Granular access defined at the service level, GCP managed.
Recommended to use this over the primitive roles
o Custom – user managed

Answer 99

A

 Fully managed service
 Identify and protect sensitive data at scale
 De-identifies data using masking, tokenization, date shifting, and more

Answer 100

A

• Stackdriver – fancy CloudWatch logs, you can build a dashboard
o Supports metrics from various services including BigQuery

Answer 101

A

o Linear regression
o Binary logistic regression
o Multi-class logistic regression
o K-means clustering (most recent addition)

Answer 102

A

•	Datproc – managed cluster service for Spark and Hadoop
o	Benefits:
	Cluster actions complete in 90 seconds
	Pay-per-second, minimum 1 minute
	Scale up/down or turn off at will

Answer 103

A

 Submit Hadoop/Spark Jobs
 (optionally) enable auto scaling to cope with the load
 Output to GCP Services (GCS, BigQuery, BigTable)
 Monitor with stackdriver – fully integrated logs

Answer 104

A

o Single node cluster – limited to the capacity of a single VM and cannot auto scale
o Standard cluster: Type and size can be customized
o High Availability Cluster

Answer 105

A

 High availability clusters
 Not permitted on single node clusters
 HDFS – make sure you have enough primary workers to store all data so you don’t have loss when it scales down
 Spark structured streaming
 Idle clusters – just delete the idle cluster and make a new one for a new job

Answer 106

A

• Cloud Storage Connector – run dataproc jobs on GCS instead of HDFS
o Cheaper than persistent disk and you get all the GCS benefits
o Decouple storage from cluster, decommission cluster when finished, no data loss

Answer 107

A

great choice for migrating Hadoop and Spark into GCP

Answer 108

A

Ease of scaling, use GCS instead of HDFS, and the connectors to other GCP services (BigQuery / BigTable)

Answer 109

A

It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.

Answer 110

A

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Answer 111

A

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Answer 112

A

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

Answer 113

A

Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data.

Answer 114

A

A MapReduce program is composed of a map procedure, which performs filtering and sorting, and a reduce method, which performs a summary operation.

Answer 115

A

• BigTable – managed wide-column NoSQL database (kvp), designed for high throughput with low latency (10,000 reads/sec, 6 ms response)
o Scalable and HA
 Originally developed internally for web indexing
 HBase created an open-source implementation of the BigTable, bought my Microsoft, adopted by Apache
o BigTable supports the HBase library for java

Answer 116

A

• Cloud BigTable Tables: (only index you get is on the row key)
o Blocks of contaguous rows are sharded into tables
 Tables are stored in Google Colossus – all splitting, merging, rebalancing, happens automatically

Answer 117

A

(large amount of small data)
o Marketing & Financial (stock prices, currency exchange rates)
o Time Series & IoT

Answer 118

A

o	SQL Support OLTP – Cloud SQL
o	OLAP – BigQuery
o	NoSQL Documents – Firestone
o	In-memory KVP – Memorystore
o	Realtime DB – Firebase

Answer 119

A

o Instances can run up to 4 clusters
o Clusters exist in a single zone

Production allows up to:
• 30 nodes per project
• 1000 tables per instance

Answer 120

A

Individual cells should be no larger than 10 mb (with history) and no row should be larger than 100 mb)

Answer 121

A

o Expiry policies define garbage collection:
 Expire based on age
 Expire based on number of versions

Answer 122

A

think about what sort of questions you may ask about the database. Scans are the most expensive operation (take the longest).

Answer 123

A

move data that may normally be in a column and combine it with the row key for querying

 Never put a timestamp at the start of a row key, this will make it impossible to balance the cluster

Answer 124

A

 Domain names
 Sequential numbers (Really bad idea, writes will always be on the end)
 Frequently updated identifiers (repeatedly updating the same row is not as performant as new rows)
 Hashed values, doesn’t really help distribution and doesn’t help with even distribution

Answer 125

A

 Store related entities in adjacent rows, balance reads/writes
 Balanced access pattersns enable linear scaling of performance

Answer 126

A

Use tall and narrow tables
• Use new rows instead of versioned cells
• Logically separate event data into different tables where possible
• Don’t re-invent the wheel, there’s multiple time series schemas available that have already been proven

Answer 127

A

 Consider field promotion to the key
 Salting
 Use the key visualizer to find hotspots

Answer 128

A

o Eventually consistent data model
o Used for:
Availability and failover
Application isolation
Global presence

Answer 129

A

o Stackdriver metrics can be used for programmatic scaling – not a built-in feature

Rebalancing tablets take time; performance may not improve for 20 min

Adding nodes does not solve a bad schema/hot node

Answer 130

A

Good

optimized schema and row key design
large datasets
correct row and column sizing

Bad

Datasets smaller than 300GB
short-lived data

Answer 131

A

If migrating from an on-prem environment, look for HBase, also consider when BigTable beats BigQuery due to the nature of the data, timeseries or latency sensitive information

Answer 132

A

– under resourced clusters, bad schema design, poorly chosen row-keys

Answer 133

A

wide tables store multiple columns for a given row-key where the query pattern is likely to require all the information about a single entity. Tall tables suit time-series or graph data and often only have a single column

Answer 134

A

Jupyter notebooks that can interact with GCP services

Answer 135

A

Free visualization tool for creating Dashboards and reports

Answer 136

A

Fully managed workflow service, on top of airflow, task organization system intended to create workflows of various complexity.

o Written in python, highly extensible
o Central management and scheduling tool
o Extensive CLI and web UI tool for managing workflows

Answer 137

A

Dataflow is specifically for Batch and Stream data using beam
Composer is a task orchestrator with python

Organizing dataflow with composer is a common pattern

Answer 138

A

each environment is an isolated installation of airflow and its component parts
o Can have multiple environments per project but each environment is independent

You write DAG’s in python for the scheduler to pick up. This is where you define the order and configuration settings for the workflows.

Answer 139

A

Pre-trained models
 No model training / knowledge of ML required

Re-useable models
 Model training required, minimal knowledge of ML

Build your own
 Deep knowledge of ML is required, lots of model training

Answer 140

A

 identifies objects within images
 able to perform facial recognition
 Can read printed and handwritten text

Answer 141

A

 Identifies objects/places/actions in videos, streamed or stored

Answer 142

A

 Translate between more than 100 different languages

Answer 143

A

 Converts text to human speech, with 180 voices across 30 languages

Answer 144

A

 Perform sentiment, entity, content classification…

Answer 145

A

o Cloud AutoML – train your own custom models to solve specific problems

Suite of ML products to facilitate training of custom ML models.

	Vision
	Video Intelligence
	Natural Language
	Natural Translation
	Tables

Answer 146

A

train the model using data that is labeled, the model will be able to infer the feature values on the label

Answer 147

A

the model is used to uncover structure within the dataset itself

Example: Uncover personas within customer data, it will take information that’s similar and group it together

Answer 148

A

o	Regression – predict a real number, ex: value of a house
o	Classification – predict the class from a specified set with a probability score
o	Clustering – group elements into clusters or groups based on how similar they are

Answer 149

A

• Overfitting – common challenge that has to be overcome training ML models
o An overfit model is not generalized, it does not fit unknown data very well. It’s too trained to the training dataset

Answer 150

A

 Increase training data size
 Feature selection – include more or reduce the number of features
 Early stopping – not too many iterations on the training data
 Cross-validation – take the training data and split it into much smaller sets, these are then used to tune a model, known as folds
• K-fold cross validation

Answer 151

A

o Batch Size
o Training epochs – number of times that the full set of training data is ran through
o Number of hidden layers in a neural network
o Regularization type
o Regularization rate
o Learning rate

Answer 152

A

 Model hyperparameters relate directly to the model that is selected

 Algorithm – hyperparameters relate to the training model

Answer 153

A

Open-source neutral network library, high-level API for fast experimentation, supported in TensorFlow’s core library

Answer 154

A

Google’s open source, end-to-end, ML framework

Answer 155

A

Tensors represent the flow of information in a neural network

Answer 156

A

Facilitates sharing of AI resources
 Hosted repo of plug and play AI components
 End-to-end pipelines
 Standard algorithms to solve common problems

Answer 157

A

o Synchronous mode – responses are returned immediately (online processing)
o Asynchronous mode – Only returns results once processing is completed (offline)

Answer 158

A

Face Detection – suggests emotional analysis
Image property detection – identify image properties (ex: Dominant colors)
Label Detection – Identify and detect objects, locations, activities, animal species, products….

Answer 159

A

• Dialogflow – Natural language interaction platform
o Used in mobile and web application, devices, and bots
o Analyses text or audio inputs
o Responds using text or speech

Answer 160

A

REST and gRPC
Returns a result after all input audio has been processed
Limited to audio of one minute or less

Answer 161

A

Rest and gRPC
initiates a long-running operation
Use the operation to poll for results

Answer 162

A

gRPC
Audio data is provided within a gRPC bi-directional stream
results produced while audio is being captured

Answer 163

A

gRPC (gRPC Remote Procedure Calls[2]) is an open source remote procedure call (RPC) system initially developed at Google in 2015 as the next generation of the RPC infrastructure Stubby.

provides features such as authentication, bidirectional streaming and flow control, blocking or nonblocking bindings, and cancellation and timeouts. It generates cross-platform client and server bindings for many languages.

Most common usage scenarios include connecting services in a microservices style architecture, or connecting mobile device clients to backend services.