ACG Notes Flashcards
3 V’s of Big Data
- ) Volume – Scale of data being handled by systems (can it be handled by a single server?)
- ) Velocity – speed in which its being processed
- ) Variety – The diversity of data sources, formats, and quality
What is a Data Warehouse?
- ) Data Warehouse
a. Structured and/or processed
b. Ready to use
c. Rigid structures – hard to change, may not be the most up to date either
What is a data lake?
- ) Data Lake
a. Raw and/or unstructured
b. Ready to analyze – more up to date but requires more advanced tools to query
c. Flexible – no structure is enforced
4 Stages of a Data Pipeline
- ) Ingestion
- ) Storage
- ) Processing
a. ETL – Data is taken from a source, manipulated to fit the destination
b. ELT – data is loaded into a data lake and transformations can take place later
c. Common transformations: Formatting / Labeling / Filtering / Validating - ) Visualization
Cloud Storage
o Unstructured object storage o Regional, dual-region, or multi-region o Standard, nearline, or cold line o Storage event triggers (pub/sub) (Usually, first steps in a cloud data pipeline)
Cloud Bigtable
o Petabyte-scale NoSQL database
o High-throughput and scalability
o Wide column key/value data
Time-series, transactional, IoT data
Cloud BigQuery
o Petabyte-scale analytics DW
o Fast SQL queries across large datasets
o Foundations for BI and AI
o Useful public datasets
Cloud Spanner
o Global SQL-based relational database
o Horizontal scalability and HA
o Strong consistency
o Not cheap to run, usually used in financial transactions
Cloud SQL
o Managed MySQL, PostgreSQL, and SQL Server instances
o Built-in backups, replicas, and failover
o Does not scale horizontally, but does scale vertically
Cloud Firestore
o Fully managed NoSQL document database
o Large collections of small JSON documents
o Realtime database with mobile SDKs
o Strong consistency
Cloud Memorystore
o Managed Redis instances
o In-memory DB, cache, or message broker
o Bult-in HA
o Vertically scalable with increasing the amount of RAM
Cloud Storage (GCS) at a high level
o Fully managed object storage
o For unstructured data: Images, videos, etc.
o Access via API or programmatic SDK
o Multiple storage classes
o Instant access in all classes, also has lifecycle management
o Secure and durable (HA and maximum durability)
GCS Concepts, what is GCS, where can buckets be?
o A bucket is a logical container for an object
o Buckets exist within projects (named within a global namespace)
o Buckets can be:
o Regional $
o Dual-regional $$ (HA)
o Multi-regional $$$ (Uses all datacenters in a region, lowest latency as well)
4 GCS Storage Classes
o Standard
o Nearline
o Coldline
o Archive
Describe standard GCS Storage Class
$0.02 per GB
99.99% regional availability
>99.99% availability in multi and dual-regions
Describe Nearline GCS Storage Class
30 day minimum storage
$0.01 per GB up / down
99.9% regional availability
99.95% availability in multi and dual regions
Describe Coldline GCS Storage Class
90 days minimum storage $.004 per GB stored $02 up/down 99.9% regional availability 99.95% availability in multi and dual region
Describe Archive GCS Storage Class
365 minimum storage $.0012 per GB stored $0.05 per GB up/down 99.9% regional availability 99.95% availability in multi and dual regions
Objects in cloud storage (encryption, changes)
o Encrypted in flight and at rest
o Objects are immutable to change you must overwrite (atomic operation)
o Objects can be versioned
name the 5 “advanced” features of GCS
o Parallel uploads of a single object
o Integrity checking – pre-calculate an md5 hash, compared to the one Google calculates
o Transcoding for compression
o Requester can pay, if desired
o Pub/Sub notifications
New files are commits that trigger a data pipeline
What is Cloud Transfer Service?
o Transfers from a source to a sink (bucket), supported sources: S3, HTTP, GCP Storage
o Transfers can be filtered based on names/dates
Schedule it for one run or periodically (can delete in source or destination after transfer is confirmed)
What is BigQuery Data Transfer Service?
o Automates data transfer to BigQuery
o Data is loaded on a regular basis
o Backfill can recover from gaps or outages
o Supported sources: Cloud Storage, Merchant Center, Google Play, S3, Teradata, Redshift
What is a transfer appliance?
physical rack storage device, 100 TB and 480 TB versions
What are the top 3 features of Cloud SQL?
o Managed SQL instances (creation, replication, backups, patches, updates)
o Multiple DB engines (MySQL, PostgreSQL, SQL Server)
o Scalability – vertically to 64 cores and 416 GB of RAM, HA options are available
Describe regional configuration of Cloud SQL
o Regional Replication:
3 read-write replicas
Every mutation requires a write quorum
This is different from traditional HA in that it’s a read AND a write replica in each zone.
Regional Cloud SQL best practices
o Design a performant schema
o Spread reads/writes around the database, avoid write hot spots
o Co-locate compute workloads in the same region
o Provision nodes to keep average CPU utilization under 65%
Multi-regional Cloud SQL benefits
5 9s SLA – 99.9999%
Reduce latency with distributed data
External consistency
• Concurrency control for transactions, this guarantees transactions are executed sequentially even across the globe
Multi-regional Cloud SQL best practices:
Design a performant schema to avoid hotspots
Co-locate write-heavy compute workloads in the same region as the leader
Spread critical workloads cross two regions
Provision nodes to keep average CPU under 45%
Cloud SQL data model
• Data model:
o Relational database tables
o Strongly typed (you must conform to a strict schema)
o Parent-child relationships, declared with primary keys and create an interleaved table
Cloud SQL transactions
• Transactions: o Locking read-write o Read-Only o Partitioned DML o Regular transactions using ANSI SQL best practices
Top 2 features of Cloud MemoryStore
o Fully managed Redis instance
o 2 tiers:
Basic tier – make sure your app can withstand full data flush
Standard tier – adds cross zone replication and automatic failover
Benefits of managed Redis (MemoryStore)
o No need to provision VMs
o Scale instances with minimal impact
o Private IPs and IAM
o Automatic replication and failover
Cloud MemoryStore use cases
o Session cache – store logins or shopping carts
o Message queue – loosely couple micro services
o Pub/sub – message queue, but also look at pub/sub
Storage options: Low latency vs. Warehouse
Low latency (use Cloud Bigtable)
- petabyte scale
- single-key rows
- Time series or IoT data
Warehouse (use BigQuery)
- Petabyte scale
- Analytics warehouse
- SQL queries
Storage options: Horizontal vs. Vertical scaling
Horizontal scaling (Cloud Spanner)
- ANSI SQL
- Global replication
- High Availability and consistency
Vertical Scaling (use Cloud SQL)
- MySQL or PostgreSQL
- Managed service
- High availability
Storage options: NoSQL vs Key/Value
NoSQL
- Fully managed document database
- Strong consistency
- Mobile SDKs and offline data
Key/Value
- Managed Redis instances
- Does what Redis does
What is MapReduce
distributed implementation of the map and reduce programming model, common interface to program these operations while abstracting away all of the systems management
4 Core modules of Hadoop & HDFS
Hadoop Common – base files
Hadoop Distributed File System (HDFS) – distributed fault tolerant file system
Hadoop YARN – resource management / job scheduling
Hadoop MapReduce – their own implementation
Apache Pig
language for analyzing large datasets, essentially an abstraction for MapReduce
o High level framework for running MapReduce jobs on Hadoop clusters
Apache Spark
general purpose cluster-computing framework
o Pretty much replaced Hadoop, uses Hadoop, much faster
Hadoop stores data in blocks on disk before, during, and after computation
Spark stores data in memory, enabling parallel operations on that data
Hadoop vs. Spark
Hadoop
- Slow disk storage
- High latency
- Used for: slow, reliable batch processing
Spark
- Fast memory storage
- Low latency
- Stream processing
- 100x faster in-memory
- 10x faster on disk
Apache Kafka
distributed streaming platform, designed for high-throughput and low-latency pub/sub stream of records
o Handles >800 billion messages per day at LinkedIn
Kafka vs. Pub/Sub
Kafka
- Guaranteed message ordering
- Tuneable message retention
- Polling (Pull) subscriptions only
- Unmanaged
Pub/Sub
- No message ordering guarantee
- 7 day maximum message retention
- Pull or Push subscriptions
- Managed
Top 2 Benefits of Pub/Sub
o Global messaging and event ingestion
o Serverless and fully managed, processes up to 500 million messages per second
Top 4 features of Pub/Sub
o Multiple pub/sub patterns, one to many, many to one, and many to many
o At least once delivery is guaranteed
o Can process messages in real-time or batch with exponential backoff
o Integrates with Cloud Dataflow
Pub/Sub use cases
o Distributing workloads o Asynchronous workflows – order processing, order, packaging, shipping o Distributing Event Notifications o Distributed Logging o Device Data Streaming
2 types of delivery method for pub/sub subscriptions:
o Pull (default) – ad-hoc request, messages must be acknowledged, or they will remain at the top of the queue and you won’t get the next message o Push – will send new messages to an endpoint, must be HTTS with a valid cert
Pub/Sub integration facts
o Fully supported by Cloud Dataflow
o Client libraries for popular languages (python)
o Cloud Functions can be triggered by events
o Cloud Run to be the receiver of a push sub
o IoT Core
Pub/Sub delivery model
o You may receive a message more than once in a single subscription
o Message Retention Duration (default 7 days) – undelivered messages are deleted
Pub/Sub lifecycle – when does a sub expire?
if No pulls / no pushes – subs expire after 31 days
Standard pub/sub model limitations
o Acknowledged messages are no longer available to subscribers
o Every message must be processed by a subscription
Pub/Sub: Seek
o Seek – you can rewind the clock and retrieve old messages up to the retention window, you can also use this to seek to a point in the future
Useful in case of an outage most commonly
Pub/Sub: Snapshot
o Snapshot – save the current state of the queue, this enables replay
Useful if you have new code and you’re not sure what it’ll do, you can snapshot back to a certain point and move forward, then seek back to replay
Pub/Sub: Ordering messages
o Ordering messages – Use timestamps when final order matters, order still isn’t guaranteed but you do have a record of that time.
If you absolutely must guarantee order, consider an alternative system
Pub/Sub: Access Control
o Use service accts for authorization, granting per-topic or per-subscription permissions
Grant limited access to publish or consume messages
Define: Cloud Dataflow
• Cloud Dataflow – fully managed, serverless ETL tool, using Apache Beam.
o Supports:
SQL, Java, and Python
Real-time and batch processing
Define: Pipeline Lifecycle
• You can run pipelines on your local machine, this is the preferred way to fix bugs • Pipeline design considerations: o Location of data o Input data structure and format o Transformation objectives o Output data structure and location
Dataflow: ParDo
• ParDo – defines the distributed operation/transformation to be performed on the PCollection of data. These can be user defined functions or pre-defined.
Dataflow: PCollections (Characteristics)
o Data types – may be of any data type but must all be of the same type. The SDK includes built-in encoding
o Access – individual access to elements is not supported, transforms are performed on all
o Immutable – cannot be changed once created
o Boundedness – no limit to the number of elements that a PCollection can contain
o Timestamp – associated with every element of the collection, assigned by the creation
Dataflow: Core Beam transforms (6)
o ParDo – generic parallel processing transforms
o GroupByKey – processing collections of KVP’s
o CoGroupByKey – used when combining multiple key collections, performs relational join
o Combine – requires you to provide a function to provide the logic, multiple pre-built functions are available, sum, min, max…
o Flatten – Multiple collections become one
o Partition – how the elements of the PCollection are split up
Dataflow Security Mechanisms
o Only users with permission can submit pipelines
o Any temp data during execution is encrypted
o Any communication between workers happens on a private network
o Access to telemetry or metrics is controlled by project permissions
Describe GCP Service Account usage
Cloud Dataflow service uses the Dataflow Service Account
Account is automatically created on flow creation
Manipulates job resources on your behalf
Assumes the “Cloud Dataflow service agent role”
Read/write access to project resources (recommended not to change this)
Worker instances will use the Controller service account
Used for metadata operations (ex: determine size of file on storage)
Also, able to use user-managed controller service acct, enabling fine grained access-control
Dataflow: Regional Endpoints
• Regional Endpoints – specifying a regional endpoint means all the worker instances will persist in that region, this is best for:
o Security and compliance
o Data locality
o Resiliency
What use case does Dataflow address
o Used for migrating MapReduce jobs to Cloud Dataflow
Define: Cloud Dataflow SQL
o Develop and run Cloud Dataflow jobs from the BigQuery web UI