GCP Data Engineer Quick Terms Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

3 V’s of Big Data

A

1.) Volume – Scale of data being handled by systems (can it be handled by a single server?)2.) Velocity – speed in which its being processed3.) Variety – The diversity of data sources, formats, and quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a Data Warehouse?

A

1.) Data Warehousea. Structured and/or processedb. Ready to usec. Rigid structures – hard to change, may not be the most up to date either

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a data lake?

A

2.) Data Lakea. Raw and/or unstructuredb. Ready to analyze – more up to date but requires more advanced tools to queryc. Flexible – no structure is enforced

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

4 Stages of a Data Pipeline

A

1.) Ingestion2.) Storage3.) Processinga. ETL – Data is taken from a source, manipulated to fit the destinationb. ELT – data is loaded into a data lake and transformations can take place laterc. Common transformations: Formatting / Labeling / Filtering / Validating4.) Visualization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Cloud Storage

A

o Unstructured object storageo Regional, dual-region, or multi-regiono Standard, nearline, or cold lineo Storage event triggers (pub/sub) (Usually, first steps in a cloud data pipeline)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Cloud Bigtable

A

o Petabyte-scale NoSQL databaseo High-throughput and scalabilityo Wide column key/value data Time-series, transactional, IoT data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Cloud BigQuery

A

o Petabyte-scale analytics DWo Fast SQL queries across large datasetso Foundations for BI and AIo Useful public datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Cloud Spanner

A

o Global SQL-based relational databaseo Horizontal scalability and HAo Strong consistencyo Not cheap to run, usually used in financial transactions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Cloud SQL

A

o Managed MySQL, PostgreSQL, and SQL Server instanceso Built-in backups, replicas, and failovero Does not scale horizontally, but does scale vertically

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Cloud Firestore

A

o Fully managed NoSQL document databaseo Large collections of small JSON documentso Realtime database with mobile SDKso Strong consistency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Cloud Memorystore

A

o Managed Redis instanceso In-memory DB, cache, or message brokero Bult-in HAo Vertically scalable with increasing the amount of RAM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Cloud Storage (GCS) at a high level

A

o Fully managed object storageo For unstructured data: Images, videos, etc.o Access via API or programmatic SDKo Multiple storage classeso Instant access in all classes, also has lifecycle managemento Secure and durable (HA and maximum durability)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

GCS Concepts, what is GCS, where can buckets be?

A

o A bucket is a logical container for an objecto Buckets exist within projects (named within a global namespace)o Buckets can be:o Regional $ o Dual-regional $$ (HA)o Multi-regional $$$ (Uses all datacenters in a region, lowest latency as well)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

4 GCS Storage Classes

A

o Standardo Nearlineo Coldline o Archive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe standard GCS Storage Class

A

$0.02 per GB99.99% regional availability>99.99% availability in multi and dual-regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe Nearline GCS Storage Class

A

30 day minimum storage$0.01 per GB up / down99.9% regional availability99.95% availability in multi and dual regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Describe Coldline GCS Storage Class

A

90 days minimum storage$.004 per GB stored$02 up/down99.9% regional availability99.95% availability in multi and dual region

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Describe Archive GCS Storage Class

A

365 minimum storage$.0012 per GB stored$0.05 per GB up/down99.9% regional availability99.95% availability in multi and dual regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Objects in cloud storage (encryption, changes)

A

o Encrypted in flight and at resto Objects are immutable to change you must overwrite (atomic operation)o Objects can be versioned

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

name the 5 “advanced” features of GCS

A

o Parallel uploads of a single objecto Integrity checking – pre-calculate an md5 hash, compared to the one Google calculateso Transcoding for compressiono Requester can pay, if desiredo Pub/Sub notifications New files are commits that trigger a data pipeline

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is Cloud Transfer Service?

A

o Transfers from a source to a sink (bucket), supported sources: S3, HTTP, GCP Storageo Transfers can be filtered based on names/dates Schedule it for one run or periodically (can delete in source or destination after transfer is confirmed)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is BigQuery Data Transfer Service?

A

o Automates data transfer to BigQueryo Data is loaded on a regular basiso Backfill can recover from gaps or outageso Supported sources: Cloud Storage, Merchant Center, Google Play, S3, Teradata, Redshift

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is a transfer appliance?

A

physical rack storage device, 100 TB and 480 TB versions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are the top 3 features of Cloud SQL?

A

o Managed SQL instances (creation, replication, backups, patches, updates)o Multiple DB engines (MySQL, PostgreSQL, SQL Server)o Scalability – vertically to 64 cores and 416 GB of RAM, HA options are available

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Describe regional configuration of Cloud SQL

A

o Regional Replication: 3 read-write replicas Every mutation requires a write quorumThis is different from traditional HA in that it’s a read AND a write replica in each zone.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Regional Cloud SQL best practices

A

o Design a performant schemao Spread reads/writes around the database, avoid write hot spotso Co-locate compute workloads in the same regiono Provision nodes to keep average CPU utilization under 65%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Multi-regional Cloud SQL benefits

A

 5 9s SLA – 99.9999%  Reduce latency with distributed data External consistency• Concurrency control for transactions, this guarantees transactions are executed sequentially even across the globe

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Multi-regional Cloud SQL best practices:

A

 Design a performant schema to avoid hotspots Co-locate write-heavy compute workloads in the same region as the leader Spread critical workloads cross two regions Provision nodes to keep average CPU under 45%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Cloud SQL data model

A

• Data model:o Relational database tableso Strongly typed (you must conform to a strict schema)o Parent-child relationships, declared with primary keys and create an interleaved table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Cloud SQL transactions

A

• Transactions:o Locking read-writeo Read-Onlyo Partitioned DMLo Regular transactions using ANSI SQL best practices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Top 2 features of Cloud MemoryStore

A

o Fully managed Redis instanceo 2 tiers: Basic tier – make sure your app can withstand full data flush Standard tier – adds cross zone replication and automatic failover

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Benefits of managed Redis (MemoryStore)

A

o No need to provision VMso Scale instances with minimal impacto Private IPs and IAMo Automatic replication and failover

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Cloud MemoryStore use cases

A

o Session cache – store logins or shopping cartso Message queue – loosely couple micro serviceso Pub/sub – message queue, but also look at pub/sub

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Storage options: Low latency vs. Warehouse

A

Low latency (use Cloud Bigtable)– petabyte scale– single-key rows– Time series or IoT data Warehouse (use BigQuery)- Petabyte scale- Analytics warehouse- SQL queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Storage options: Horizontal vs. Vertical scaling

A

Horizontal scaling (Cloud Spanner)- ANSI SQL- Global replication- High Availability and consistencyVertical Scaling (use Cloud SQL)- MySQL or PostgreSQL- Managed service- High availability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Storage options: NoSQL vs Key/Value

A

NoSQL- Fully managed document database- Strong consistency- Mobile SDKs and offline dataKey/Value- Managed Redis instances- Does what Redis does

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is MapReduce

A

distributed implementation of the map and reduce programming model, common interface to program these operations while abstracting away all of the systems management

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

4 Core modules of Hadoop & HDFS

A

 Hadoop Common – base files Hadoop Distributed File System (HDFS) – distributed fault tolerant file system Hadoop YARN – resource management / job scheduling Hadoop MapReduce – their own implementation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Apache Pig

A

language for analyzing large datasets, essentially an abstraction for MapReduceo High level framework for running MapReduce jobs on Hadoop clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Apache Spark

A

general purpose cluster-computing frameworko Pretty much replaced Hadoop, uses Hadoop, much faster Hadoop stores data in blocks on disk before, during, and after computation Spark stores data in memory, enabling parallel operations on that data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Hadoop vs. Spark

A

Hadoop- Slow disk storage- High latency- Used for: slow, reliable batch processingSpark- Fast memory storage- Low latency- Stream processing- 100x faster in-memory- 10x faster on disk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Apache Kafka

A

distributed streaming platform, designed for high-throughput and low-latency pub/sub stream of recordso Handles >800 billion messages per day at LinkedIn

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Kafka vs. Pub/Sub

A

Kafka- Guaranteed message ordering- Tuneable message retention- Polling (Pull) subscriptions only- UnmanagedPub/Sub- No message ordering guarantee- 7 day maximum message retention- Pull or Push subscriptions- Managed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Top 2 Benefits of Pub/Sub

A

o Global messaging and event ingestiono Serverless and fully managed, processes up to 500 million messages per second

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Top 4 features of Pub/Sub

A

o Multiple pub/sub patterns, one to many, many to one, and many to manyo At least once delivery is guaranteedo Can process messages in real-time or batch with exponential backoffo Integrates with Cloud Dataflow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Pub/Sub use cases

A

o Distributing workloadso Asynchronous workflows – order processing, order, packaging, shippingo Distributing Event Notificationso Distributed Loggingo Device Data Streaming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

2 types of delivery method for pub/sub subscriptions:

A

o Pull (default) – ad-hoc request, messages must be acknowledged, or they will remain at the top of the queue and you won’t get the next messageo Push – will send new messages to an endpoint, must be HTTS with a valid cert

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Pub/Sub integration facts

A

o Fully supported by Cloud Dataflowo Client libraries for popular languages (python)o Cloud Functions can be triggered by eventso Cloud Run to be the receiver of a push subo IoT Core

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Pub/Sub delivery model

A

o You may receive a message more than once in a single subscriptiono Message Retention Duration (default 7 days) – undelivered messages are deleted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Pub/Sub lifecycle – when does a sub expire?

A

if No pulls / no pushes – subs expire after 31 days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Standard pub/sub model limitations

A

o Acknowledged messages are no longer available to subscriberso Every message must be processed by a subscription

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Pub/Sub: Seek

A

o Seek – you can rewind the clock and retrieve old messages up to the retention window, you can also use this to seek to a point in the future Useful in case of an outage most commonly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Pub/Sub: Snapshot

A

o Snapshot – save the current state of the queue, this enables replay Useful if you have new code and you’re not sure what it’ll do, you can snapshot back to a certain point and move forward, then seek back to replay

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Pub/Sub: Ordering messages

A

o Ordering messages – Use timestamps when final order matters, order still isn’t guaranteed but you do have a record of that time.  If you absolutely must guarantee order, consider an alternative system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Pub/Sub: Access Control

A

o Use service accts for authorization, granting per-topic or per-subscription permissions Grant limited access to publish or consume messages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Define: Cloud Dataflow

A

• Cloud Dataflow – fully managed, serverless ETL tool, using Apache Beam.o Supports: SQL, Java, and Python Real-time and batch processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Define: Pipeline Lifecycle

A

• You can run pipelines on your local machine, this is the preferred way to fix bugs• Pipeline design considerations:o Location of datao Input data structure and formato Transformation objectiveso Output data structure and location

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Dataflow: ParDo

A

• ParDo – defines the distributed operation/transformation to be performed on the PCollection of data. These can be user defined functions or pre-defined.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Dataflow: PCollections (Characteristics)

A

o Data types – may be of any data type but must all be of the same type. The SDK includes built-in encoding o Access – individual access to elements is not supported, transforms are performed on allo Immutable – cannot be changed once createdo Boundedness – no limit to the number of elements that a PCollection can containo Timestamp – associated with every element of the collection, assigned by the creation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Dataflow: Core Beam transforms (6)

A

o ParDo – generic parallel processing transformso GroupByKey – processing collections of KVP’so CoGroupByKey – used when combining multiple key collections, performs relational joino Combine – requires you to provide a function to provide the logic, multiple pre-built functions are available, sum, min, max…o Flatten – Multiple collections become oneo Partition – how the elements of the PCollection are split up

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

Dataflow Security Mechanisms

A

o Only users with permission can submit pipelineso Any temp data during execution is encryptedo Any communication between workers happens on a private networko Access to telemetry or metrics is controlled by project permissions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Describe GCP Service Account usage

A

Cloud Dataflow service uses the Dataflow Service Account Account is automatically created on flow creation Manipulates job resources on your behalf Assumes the “Cloud Dataflow service agent role” Read/write access to project resources (recommended not to change this)Worker instances will use the Controller service account Used for metadata operations (ex: determine size of file on storage) Also, able to use user-managed controller service acct, enabling fine grained access-control

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Dataflow: Regional Endpoints

A

• Regional Endpoints – specifying a regional endpoint means all the worker instances will persist in that region, this is best for:o Security and complianceo Data localityo Resiliency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What use case does Dataflow address

A

o Used for migrating MapReduce jobs to Cloud Dataflow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

Define: Cloud Dataflow SQL

A

o Develop and run Cloud Dataflow jobs from the BigQuery web UI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

What service does Cloud Dataflow integrate with and what are the benefits?

A

o Integrates with Apache Beam SQL Apache Beam SQL:• Can query bounded and unbounded PCollections• Query is converted to a SQL transformation Cloud Dataflow SQL benefits:• Join streams with BigQuery tables• Query streams or static datasets• Write output to BigQuery for analysis and visualization

67
Q

What type of client is Dataflow for? (on prem to cloud migration)

A

• Go with the flow – ideal solution for customers using Apache Beamo Batch == Dataproc & sparko Streaming == Beam and Dataflow

68
Q

Pipelines and PCollections

A

o The pipeline represents the complete set of stages required to read, transform, and write data using the Apache Beam SDKo PCollection – represents a multi-element dataset that is processed by the pipeline

69
Q

ParDo and DoFn

A

o ParDo – Core parallel processing function of Apache beam which can transform elements of an input PCollection into an output PCollection, can invoke UDF’so DoFn – template you use to create UDF that are referenced by a ParDo

70
Q

Dataflow windowing

A

allows streaming data to be grouped into finite collections according to time or session-based windows Useful when you need to impose ordering or constraints on pub/sub data

71
Q

Dataflow Watermarking

A

o Watermark – indicates when a Dataflow expects all data in a window to have arrived Data that arrives with a timestamp that’s inside the window but past the watermark is considered late, policies can decide what happens to this.

72
Q

Dataflow vs. Cloud Composer

A

o Dataflow is normally the preferred option for data pipelineso Composer may sometimes be used for ad-hoc orchestration or to provide manual control of Dataflow pipelines themselves

73
Q

Dataflow Triggers

A

• Triggers – determine when to emit aggregated results as data arrives. o For bounded data, results are emitted after all the input has been processed.o For unbounded data, results are emitted when the watermark passes the end of the window, indicating that the system believes all input data for that window has been processed.

74
Q

Define: BigQuery

A

Peta-byte scale, serverless, highly-scalable cloud enterprise DW

75
Q

BigQuery Key Features

A

o Highly available – automatic data replicationo Supports standard SQL – ansi complianto Supports Federated Data – connects to several external sourceso Automatic Backups – automatically replicates data and keeps a 7-day history of changeso Support for Governance and Security – fine-grained IAMo Separation of storage and compute – ACID compliant, stateless compute

76
Q

BigQuery Data Management Architecture

A

Project  Dataset – container for tables/views, think of this like a database• Native table – standard table, where data is held in BQ storage• External tables – backed by storage outside of BQ• Views or virtual tables – created by a SQL Query

77
Q

BigQuery data ingestion (2 types of sources)

A

Real-time events Generally streamed using pub/sub, then use cloud dataflow to process these and push to BQBatch sources Push files to cloud storage, have cloud dataflow pick up the data and put in BQ

78
Q

BigQuery: Job

A

action that is run in BigQuery on your behalf asynchronouslyo 4 types: Load / Export / Query / Copy

79
Q

BigQuery supported import formats

A

• Importing data: supported formats: csv, json, Avro, Parquet, ORC, Datastore/Firestore exports

80
Q

BigQuery: Views

A

o Control access to datao Reduce query complexityo Can be used to construct logical tableso Enables authorized views – users can have access to different subsets of rows

81
Q

BigQuery: Limitations of views

A

 Cannot export data from a view Cannot use JSON API to retrieve data from a view No UDF’s Limited to 1k authorized views per dataset

82
Q

BigQuery: Supported external data sources

A

Supports BigTable, Cloud Storage, and Google Driveo Use cases: Load and clean data in one pass Small, frequently changing data joined with other tables

83
Q

BigQuery External data sources; limitations

A

Limitations: No guarantee of consistency  Lower query performance Cannot run export jobs on external data Cannot query Parquet or ORC formats Results not cached Limited to 4 concurrent queries

84
Q

BigQuery: 2 methods of partitioning

A

 Ingestion time partition tables Partitioned tables

85
Q

BigQuery: Ingestion time partition tables

A

• Partitioned by load or arrival date• Data automatically loaded into date-based partitions (daily)• Tables include the pseudo-column _PARTITIONTIME• use _PARTITIONTIME in queries to limit partitions scanned (in the where)

86
Q

BigQuery: Partitioned tables

A

• Partitioning is based on specific TIMESTAMP or DATE column• Data partitioned based on value supplied in partitioning column• 2 additional partitions:o __null__ o __UNPARTITIONED__• Use partitioning column in queries

87
Q

BigQuery: Why do we care about partitioned tables?

A

 Improve query performance, less data is read/processed Cost control, you pay for all the data processed by a query

88
Q

BigQuery: Clustering

A

o Like creating an index on the tableo Supported for both types of partitioning, unsupported for non-partitioned tableso Create a cluster key on frequently accessed columns (order is important, it’s the exact order you’ll be accessing the data)

89
Q

BigQuery: Clustering Limitations

A

 Specify clustering columns only on table create, cannot be changed Clustering columns must be top-level, non-repeated columns Only supported for partitioned tables Can specify one to four clustering columns

90
Q

BigQuery: Querying clustered tables

A

 Filter clustered columns in the order they were specified: Avoid using clustered columns in complex filter expressions Avoid comparing cluster columns to other columns

91
Q

BigQuery: Slots

A

Slots – unit of computational capacity required to execute SQL querieso Play a role in pricing and resource allocationo Determined by: Query size Query complexity (amount of info shuffled)o Automatically managed

92
Q

BigQuery: 3 main topics of best practices

A

Controlling CostsQuery PerformanceOptimizing Storage

93
Q

BigQuery: How to control costs

A

 Avoid SELECT * (columns are stored separately) Use preview options to sample data Price queries before executing them (price is per byte), there’s a calculator Using LIMIT does not affect costs, the full columns are read and processed View costs using a dashboard and query audit logs Partition by date in your queries Use streaming inserts with caution, costly, go with bulk loads

94
Q

BigQuery: Query Performance: Input Data and Data Sources

A

• Prune partitioned queries (don’t query partitions you don’t need)• De-normalize data whenever possible• Use external data sources appropriately• Avoid wildcard tables o A wildcard table represents a union of all the tables that match the wildcard expression.

95
Q

BigQuery: Query Performance: Query computation

A

• Avoid repeatedly transforming data via SQL queries• Avoid JS UDF’s• Order query operations to maximize performance• Optimize JOIN patterns

96
Q

BigQuery: Query Performance: SQL anti-patterns

A

• Avoid self-joins• Avoid data skew• Avoid unbalanced joins• Avoid joins that generate more outputs than inputs• Avoid DML statements that update or insert single rows

97
Q

BigQuery: Query Performance: Optimizing Storage

A

• Use expiration settings (tables auto deleted after expiration)• Take advantage of long-term storageo Lower monthly charges apply for data stored in tables or in partitions that have not been modified in 90 days

98
Q

BigQuery: 3 Types of Roles relating to BQ

A

3 Types of roles relating to BigQueryo Primitive – defined at the project level 3 types: Owner, Editor, Viewero Predefined – Granular access defined at the service level, GCP managed. Recommended to use this over the primitive roleso Custom – user managed

99
Q

Cloud Data Loss Prevention (Cloud DLP) API

A

 Fully managed service Identify and protect sensitive data at scale De-identifies data using masking, tokenization, date shifting, and more

100
Q

Stackdriver

A

• Stackdriver – fancy CloudWatch logs, you can build a dashboardo Supports metrics from various services including BigQuery

101
Q

4 ML Models

A

o Linear regressiono Binary logistic regressiono Multi-class logistic regressiono K-means clustering (most recent addition)

102
Q

Dataproc Benefits

A

• Datproc – managed cluster service for Spark and Hadoopo Benefits: Cluster actions complete in 90 seconds Pay-per-second, minimum 1 minute Scale up/down or turn off at will

103
Q

Using Dataproc

A

 Submit Hadoop/Spark Jobs (optionally) enable auto scaling to cope with the load Output to GCP Services (GCS, BigQuery, BigTable) Monitor with stackdriver – fully integrated logs

104
Q

Dataproc Cluster Types

A

o Single node cluster – limited to the capacity of a single VM and cannot auto scale o Standard cluster: Type and size can be customizedo High Availability Cluster

105
Q

When should you not use autoscaling with Dataproc?

A

 High availability clusters Not permitted on single node clusters HDFS – make sure you have enough primary workers to store all data so you don’t have loss when it scales down Spark structured streaming Idle clusters – just delete the idle cluster and make a new one for a new job

106
Q

Define: Dataproc Cloud Storage Connector

A

• Cloud Storage Connector – run dataproc jobs on GCS instead of HDFSo Cheaper than persistent disk and you get all the GCS benefitso Decouple storage from cluster, decommission cluster when finished, no data loss

107
Q

What on-prem technologies can be replaced by Dataproc?

A

great choice for migrating Hadoop and Spark into GCP

108
Q

What are the benefits of Dataproc?

A

Ease of scaling, use GCS instead of HDFS, and the connectors to other GCP services (BigQuery / BigTable)

109
Q

Define: Hadoop

A

It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.

110
Q

Define: Spark

A

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

111
Q

Define: Zookeeper

A

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

112
Q

Define: Hive

A

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

113
Q

Define: Tez

A

Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data.

114
Q

Define: MapReduce

A

A MapReduce program is composed of a map procedure, which performs filtering and sorting, and a reduce method, which performs a summary operation.

115
Q

Define: BigTable

A

• BigTable – managed wide-column NoSQL database (kvp), designed for high throughput with low latency (10,000 reads/sec, 6 ms response)o Scalable and HA Originally developed internally for web indexing HBase created an open-source implementation of the BigTable, bought my Microsoft, adopted by Apacheo BigTable supports the HBase library for java

116
Q

Where are BigTable “tables” stored?

A

• Cloud BigTable Tables: (only index you get is on the row key)o Blocks of contaguous rows are sharded into tables Tables are stored in Google Colossus – all splitting, merging, rebalancing, happens automatically

117
Q

What are the typical use cases of BigTable?

A

(large amount of small data)o Marketing & Financial (stock prices, currency exchange rates)o Time Series & IoT

118
Q

What are the alternatives to BigTable?

A

o SQL Support OLTP – Cloud SQLo OLAP – BigQueryo NoSQL Documents – Firestoneo In-memory KVP – Memorystoreo Realtime DB – Firebase

119
Q

How many clusters can be ran per BigTable instance and where do they exist?

A

o Instances can run up to 4 clusterso Clusters exist in a single zoneProduction allows up to:• 30 nodes per project• 1000 tables per instance

120
Q

BigTable individual cell/row data limit

A

Individual cells should be no larger than 10 mb (with history) and no row should be larger than 100 mb)

121
Q

BigTable garbage collection policies

A

o Expiry policies define garbage collection: Expire based on age Expire based on number of versions

122
Q

BigTable Query Planning

A

think about what sort of questions you may ask about the database. Scans are the most expensive operation (take the longest).

123
Q

What is field promotion in BigTable?

A

move data that may normally be in a column and combine it with the row key for querying Never put a timestamp at the start of a row key, this will make it impossible to balance the cluster

124
Q

What BigTable row keys should be avoided?

A

 Domain names Sequential numbers (Really bad idea, writes will always be on the end) Frequently updated identifiers (repeatedly updating the same row is not as performant as new rows) Hashed values, doesn’t really help distribution and doesn’t help with even distribution

125
Q

How to design BigTable for performance

A

 Store related entities in adjacent rows, balance reads/writes Balanced access pattersns enable linear scaling of performance

126
Q

BigTable: How to store time series data

A

Use tall and narrow tables• Use new rows instead of versioned cells• Logically separate event data into different tables where possible• Don’t re-invent the wheel, there’s multiple time series schemas available that have already been proven

127
Q

BigTable: How to avoid hotspots

A

 Consider field promotion to the key Salting  Use the key visualizer to find hotspots

128
Q

BigTable data model replication

A

o Eventually consistent data modelo Used for: Availability and failover Application isolation Global presence

129
Q

BigTable Autoscaling, how-to and what to expect

A

o Stackdriver metrics can be used for programmatic scaling – not a built-in feature Rebalancing tablets take time; performance may not improve for 20 minAdding nodes does not solve a bad schema/hot node

130
Q

Performance in BigTable good/bad

A

Good- optimized schema and row key design- large datasets- correct row and column sizingBad- Datasets smaller than 300GB- short-lived data

131
Q

When do you choose BigTable?

A

If migrating from an on-prem environment, look for HBase, also consider when BigTable beats BigQuery due to the nature of the data, timeseries or latency sensitive information

132
Q

What are common causes of poor performance in BigTable?

A

– under resourced clusters, bad schema design, poorly chosen row-keys

133
Q

BigTable design: Tall vs. Wide

A

wide tables store multiple columns for a given row-key where the query pattern is likely to require all the information about a single entity. Tall tables suit time-series or graph data and often only have a single column

134
Q

Define: Datalab

A

Jupyter notebooks that can interact with GCP services

135
Q

Define: Data Studio

A

Free visualization tool for creating Dashboards and reports

136
Q

Define: Cloud Composer

A

Fully managed workflow service, on top of airflow, task organization system intended to create workflows of various complexity. o Written in python, highly extensibleo Central management and scheduling toolo Extensive CLI and web UI tool for managing workflows

137
Q

Dataflow vs Composer

A

• Dataflow is specifically for Batch and Stream data using beam• Composer is a task orchestrator with pythonOrganizing dataflow with composer is a common pattern

138
Q

Cloud Composer architecture

A

each environment is an isolated installation of airflow and its component partso Can have multiple environments per project but each environment is independentYou write DAG’s in python for the scheduler to pick up. This is where you define the order and configuration settings for the workflows.

139
Q

ML: 3 major categories/learning options

A

Pre-trained models No model training / knowledge of ML requiredRe-useable models Model training required, minimal knowledge of MLBuild your own  Deep knowledge of ML is required, lots of model training

140
Q

ML: Cloud Vision API

A

 identifies objects within images able to perform facial recognition Can read printed and handwritten text

141
Q

ML: Cloud Video Intelligence API

A

 Identifies objects/places/actions in videos, streamed or stored

142
Q

ML: Cloud Translation API

A

 Translate between more than 100 different languages

143
Q

ML: Cloud Speech-To-Text API

A

 Converts text to human speech, with 180 voices across 30 languages

144
Q

ML: Natural Language API

A

 Perform sentiment, entity, content classification…

145
Q

Define: Cloud AutoML

A

o Cloud AutoML – train your own custom models to solve specific problemsSuite of ML products to facilitate training of custom ML models.  Vision Video Intelligence Natural Language Natural Translation Tables

146
Q

ML Supervised Learning

A

train the model using data that is labeled, the model will be able to infer the feature values on the label

147
Q

ML Unsupervised Learning

A

the model is used to uncover structure within the dataset itselfExample: Uncover personas within customer data, it will take information that’s similar and group it together

148
Q

ML: Top 3 model types

A

o Regression – predict a real number, ex: value of a houseo Classification – predict the class from a specified set with a probability scoreo Clustering – group elements into clusters or groups based on how similar they are

149
Q

ML: Overfitting

A

• Overfitting – common challenge that has to be overcome training ML modelso An overfit model is not generalized, it does not fit unknown data very well. It’s too trained to the training dataset

150
Q

ML: How to deal with overfitting

A

 Increase training data size Feature selection – include more or reduce the number of features Early stopping – not too many iterations on the training data Cross-validation – take the training data and split it into much smaller sets, these are then used to tune a model, known as folds• K-fold cross validation

151
Q

Name some examples of hyperparameters

A

o Batch Sizeo Training epochs – number of times that the full set of training data is ran througho Number of hidden layers in a neural networko Regularization typeo Regularization rateo Learning rate

152
Q

Name the 2 types of hyperparameters

A

 Model hyperparameters relate directly to the model that is selected Algorithm – hyperparameters relate to the training model

153
Q

Define: Keras

A

Open-source neutral network library, high-level API for fast experimentation, supported in TensorFlow’s core library

154
Q

Define: TensorFlow

A

Google’s open source, end-to-end, ML framework

155
Q

Define: Tensor

A

Tensors represent the flow of information in a neural network

156
Q

What is the AI Hub?

A

Facilitates sharing of AI resources Hosted repo of plug and play AI components End-to-end pipelines Standard algorithms to solve common problems

157
Q

ML: Vision AI, 2 modes

A

o Synchronous mode – responses are returned immediately (online processing)o Asynchronous mode – Only returns results once processing is completed (offline)

158
Q

ML: Vision AI, Detection Modes

A

• Face Detection – suggests emotional analysis• Image property detection – identify image properties (ex: Dominant colors)• Label Detection – Identify and detect objects, locations, activities, animal species, products….

159
Q

Define: Dialogflow

A

• Dialogflow – Natural language interaction platformo Used in mobile and web application, devices, and botso Analyses text or audio inputso Responds using text or speech

160
Q

ML: Cloud speech-to-text usage: Synchronous Recognition

A
  • REST and gRPC- Returns a result after all input audio has been processed- Limited to audio of one minute or less
161
Q

ML: Cloud speech-to-text usage: Asynchronous Recognition

A
  • Rest and gRPC- initiates a long-running operation- Use the operation to poll for results
162
Q

ML: Cloud speech-to-text usage: Streaming Recognition

A
  • gRPC- Audio data is provided within a gRPC bi-directional stream- results produced while audio is being captured
163
Q

Define: gRPC

A

gRPC (gRPC Remote Procedure Calls[2]) is an open source remote procedure call (RPC) system initially developed at Google in 2015 as the next generation of the RPC infrastructure Stubby.provides features such as authentication, bidirectional streaming and flow control, blocking or nonblocking bindings, and cancellation and timeouts. It generates cross-platform client and server bindings for many languages. Most common usage scenarios include connecting services in a microservices style architecture, or connecting mobile device clients to backend services.