Big Data Flashcards
What defines Big Data (3V)
Volume, Velocity, Veracity
What is Volume
The scale of information being handled by data processing system
What is Velocity
The speed at which data is being processed: ingested, analyzed, and visualized
What is Variety
The diversity of data sources, formats, and quality
Data Warehouses
- Structural or Processed: Data is organized, may have been transformed, and is stored in a structural way
- Ready to use: Data exists in the warehouse for a defined purpose, and in a format where it is ready to be consumed
- Rigid: Data may be easier to understand, but less up-to-date. Structures are hard to change
Data Lakes
- Raw or Unstructured: The data lake contains all raw unprocessed data, before any kind of transformation or organization
- Ready to analyze: Data is more up to date, but may require more advanced tools for analysis
- Flexible: No structure is enforced, so new types of data can be added at any time
OLTP
- High volume of short transactions
- Fast queries
- high integrity
MODIFY DATA
OLAP
- Low volume of long-running queries
- Aggregated historical data
QUERY DATA
Stages of a Data Pipeline
- Ingestion
- Storage
- Processing
- Visualization
Data ingestion Technical Challenges
- choose the correct compute and storage options. Otherwise, a solution can be too expensive or too slow
- data should have value
- security of data
Common data transformations
- formatting
- labeling
- filtering
- validating
Stages of Data Modeling
- Conceptual. What are the entities in my data? What are their attributes and relationships?
- Logical
- Physical
Google Cloud Storage (GCS)
- Fully managed object storage
For unstructured data: images, videos. Access via API or programmatic SDKs - Multiple storage classes
Instant access in all classes. Lifecycle management for objects and buckets - Secure and durable
Secure access control. High availability and maximum durability
Google Cloud Storage concepts (buckets)
- a bucket is a logical container for objects
- buckets exist within projects
- bucket names exists within a global namespace
- bucket can be:
- regional
- dual-regional
- nulti-regional
Storage classes in GCS
- Standard
- Nearline
- Coldline
- Archive
Standard storage class in GCS
minimum storage: -
storage fee (per Gb): $0.02
retrieval fee: -
regional availability: 99.99%
multi and dual reg.: > 99.99%
Nearline storage class in GCS
minimum storage: 30 days
storage fee (per Gb): $ 0.01
retrieval fee: $ 0.01
regional availability: 99.9%
multi and dual reg.: 99.95%
Coldline storage class in GCS
minimum storage: 90 days
storage fee (per Gb): $0.004
retrieval fee: $ 0.02
regional availability: 99.9%
multi and dual reg.: 99.95%
Archive storage class in GCS
minimum storage: 365 days
storage fee (per Gb): $0.0012
retrieval fee: $0.05
regional availability: 99.9%
multi and dual reg.: 99.95%
Objects in Google Cloud Storage
- Objects are stored as opaque data
- Objects are immutable
- Overwrites are atomic
- Objects can be versioned (optionally)
Accessing Buckets and Objects
- Google Cloud Console
- HTTP API
- SDKs
- gsutil (command line tool)
Advanced features of Google Cloud Storage
- Parallel uploads of composite objects
- Integrity checking
- Transcoding
- Requestor pays
Google Cloud Storage Costs
- operation charges
- network charges
- data retrieval charges
Google Cloud storage Lifestyle management
- apply a lifestyle configuration to a bucket
- GCS periodically checks configuration
- matching rules applied to objects
- delete objects or set storage classes
lifestyle management configuration file is a JSON-file
Security and Access Control in GCS
- IAM for bulk access to buckets
- ACLS* for granular access to buckets
- Signed URLs for temporary access
- Signed policy documents
- Access control lists
Amazon analog of Google Cloud SQL
Amazon RDS
Google Cloud SQL
- Managed SQL instances
Automate instance and database creation, replication, backups, patches and updates - Multiple database engines
MySQL 5.6 and 5.7, PostgreSQL 9.6 or 11, SQL Server in beta - Scalability and availability
Vertically scale to 64 cores and 416 Gb RAB. Live migration and less configurations
Google Cloud Firestore replaces…
Cloud DataStore
Amazon analog of Google Cloud Firestore
Amazon DynamoDB
Google Cloud Firestore
- Fully managed NoSQL database
Serverless autoscaling NoSQL document store. Integrated with GCP and Firebase - Realtime DB with mobile SDK
Android and IOS client libraries, frameworks for all popular programming languages - Scalability and consistency
Horizontal autoscaling and strong consistency, with support of ACID transactions
Firestore Data Model
- it’s a document store
- a document is just some JSON data
- documents bundled together into a collection
- documents can contain nested sub-collection
- references
Firestore supported datatypes
- String, integer, boolean, float, null
- Bytes, date and time, geographical point
- Array and map
- Reference (to document)
Indexes in Cloud Firestore
- Automatic single-field indexes
- Index exemption
- Composite indexes
Google Cloud Spanner
- Managed SQL-compiant DB
SQL (ANSI 2011) schemas and queries with ACID transactions - Horizontally scalable
Strong consistency across rows, regions from 1 to 1000 of nodes - Highly available
Automatic global replication, no planned downtime and 99.999% SLA
CAP theorem
either 2 of 3:
- Consistency
- Availability
- Partition tolerance
Google Cloud Spanner is (CAP)
CP system. Sometimes it sacrifices availability for consistency