Analysis Flashcards
Amazon Machine Learning
- Provides visualization tools and wizards to make creating a model easy
- Fully managed
- Outdated now
Amazon Machine Learning Cost Model
- Charged for compute time
Amazon Machine Learning Promises
- No downtime
- Up to 100GB training data
- Up to 5 simultaneous jobs
Amazon Machine Learning Anti Pattern
- Terabyte-scale data
- Unsupported learning tasks
- sequence prediction
- unsupervised clustering
- deep learning
AWS SageMaker
- Build, Train and Deploy models
- Tensorflow, Apache MXNet
- GPU accelerated deep learning
- Scaling effectively unlimited
- hyperparameter tuning jobs
AWS SageMaker Security
- Code stored in “ML storage volumes”
- All artifacts encrypted in transit and at rest
- API and console secured by SSL
- KMS integration for SageMaker notebook, training jobs, endpoints
Deep Learning on EC2 / EMR
- EMR supports Apache MXNet and GPU Instance types
- Appropriate instance types for deep learning
- P3 : 8 Tesla V100 GPU
- P2 : 16 K80 GPU
- G3 : 4 M60 GPU
- Deep Learning AMI’s
AWS Data Pipeline
- Manages task dependencies
- Retries and notifies on failures
- Highly available
- Destination : S3, RDS, DynamoDB, Redshift, EMR
Kinesis Data Analytics
- Fully managed and serverless
- Transform, analyze streaming data in real time with Aapche Flink
- Reference tables are inexpensive to join data for quick lookups
- Use Flink under the hood
- Flink is a framework for processing data streams
- Kinesis Data Analytics integrates Flink with AWS
- Use Cases : Continuous metric generation, responsive real-time analytics, etc
- 1KPU = 1 vCPU and 4GB memory
Kinesis Data Analytics + Lambda
- Post processing
- aggregate row, translating to different formats, transforming and enriching data
Kinesis Data Analytics Use Cases
- Streaming ETL
- Continuous metric generation
- Responsive analysis
RANDOM_CUT_FOREST
- SQL function used for anomaly detection on numeric columns in a stream
Amazon Opensearch Service (Formerly ElasticSearch)
- A fork of ElasticSearch and Kibana
- A search engine
- Fully managed
- Scale up and down without downtime
OpenSearch Use Cases
- Full text search
- Log analytics
- Application monitoring
- Security analytics
- Clickstream analytics
OpenSearch Concepts
- Documents
- docs are hashed to a particular shard
- Indices
- Index has primary shard and 2 replicas
- Application should make request round-robin amongst nodes
- Write requests are routed to primary shard, then replicated
- Read requests are routed to primary or any replicas
OpenSearch Options
- Dedicated master node(s)
- Choice of count and instance types
- Domains
- Zone Awareness
OpenSearch Cold Warm UltraWarm Hot Storage
- Standard data use “hot” storage
- instance stores or EBS volumes
- UltraWarm and Warm storage uses S3+caching
- Cold Storage
- Use s3
- Must have dedicated master and have ultrawarm enabled too
- Data may be migrated between different storage types
OpenSearch Index State Management
- Automates index management policies
- Example
- delete old indices after a period time
- move indices from hot -> ultra warm -> warm -> cold storage over time
- Automate index snapshot
- ISM policies are run every 30-48 minutes
- Index rollups
- periodically roll up old data into summarized indices
- saves storage costs
- new index may have fewer fields, coarser time buckets
- index transform
- to create a different view to analyze data differently
- groupings and aggregations
OpenSearch Cross Cluster Replication
- replicate indices / mappings / metadata across domains
- replicate data geographically for better latency
- “follower” index pulls data from “leader” index
- With cross-cluster replication, we index data to a leader index and OpenSearch replicates that data to one or more read-only follower indices
- “remote reindex” allows copying indices from one cluster to another on demand
OpenSearch Stability
- 3 dedicated master nodes is best
- avoids “split brain”
- do not run out of disk space
- minimum storage requirement is roughly : source data * (1 + num of replicas) * 1.45
- Choosing the number of shards
- Choosing instance types
- at least 3 nodes
- mostly abour storage requirements
OpenSearch Security
- resource-based policies
- identity based policies
- VPC
- Cognito
OpenSearch Anti Pattern
- OLTP
- ad-hoc data querying
- OpenSearch is primarily for search and analytics
OpenSearch Performance
- memory pressure in the JVM can result if
- unbalanced shard allocations across nodes
- too many shards in a cluster
- Fewer shards can yield better performance if JVMMemoryPressure errors are encountered
- delete old or unused indices
Amazon Athena
- serverless
- interactive query service for s3 (SQL)
- Presto under the hood
- Supports many data formats
- csv, json, orc, parquet, avro
- unstructured, semi-structured or structured
Amazon Athena Use Cases
- ad-hoc queries of web logs
- querying staging data before loading to redshift
- analyze cloudtrail / cloudfront / vpc logs in s3
- integration with Jupyter, Zeppelin, RStudio, QuickSight and other visualization tools
Athena Workgroups
- can organize users / teams / apps / workloads into WORKGROUPS
- can control query access and track costs by Workgroups
- Each workgroup has its own
- query history
- data limits
- iam policies
- encryption settings
Athena Cost Model
- Pay as you go
- $5 per TB scanned
- sccessful or cancelled queries count. Failed queries do not count
- No charge for DDL (CREATE/ALTER/DROP etc)
- Save lots of money by using columnar formats
- orc, parquet
- save 30-90% and get better performance
Athena Security
- Transport Layer Security (TLS) encrypts in-transit between Athena and S3
Athena Anti Pattern
- Highly formatted reports / visualization
- QuickSight better
- ETL
- use Glue instead
Athena Optimized Performance
- Use columnar data (orc, parquet)
- small number of large files performs better than large number of small files
- Use partitions
Athena ACID transactions
- Powered by Apache Iceberg
- Just add ‘table_type’ = ‘ICEBERG’ in create table statement
- concurrent users can safely make row-level modifications
- compatible with EMR, Spark, anything that supports Icebery format
- removes need for custom record locking
- time travel operations
Amazon Redshift
- Fully managed, petabyte scale data warehouse
- Designed for OLAP not OLTP
- Cost effective
- SQL, ODBC, JDBC interfaces
- Scale up or down on demand
- Built in replication and backups
- Monitoring via CloudWatch / CloudTrail
- Query exabytes of unstructured data in S3 without loading
- limitless concurrency
- Horizontal scaling
- Separate compute and storage resources
- Wide variety of data formats
- Support of Gzip and Snappy compression
Redshift Use Cases
- Accelerate analytics workloads
- Unified data warehouse and data lake
- Data warehouse modernization
- Analyze global sales data
- Store historical stock trade data
- Analyze ad impressions and clicks
- Aggregate gaming data
- Analyze social trends
Redshift Performance
- Massively Parallel Processing
- Columnar Data Storage
- Column Compression
Redshift Durability
- Replication within cluster
- Backup to S3 (Asynchronously replicated to antoher region)
- Automated snapshots
- Failed drives / nodes automatically replaced
- However, limited to a single availability zone
Redshift Scaling
- vertical and horizontal scaling on demand
- during scaling
- a new cluster is created while your old one remains available for reads
- CNAME is flipped to new cluster (a few mins of downtime)
- data moved in parallel to new compute nodes
- concurrency scaling
- automatically adds cluster capacity to handle increase in concurrent read queries
- support virtually unlimited concurrent users and queries
Redshift Distribution Styles
- AUTO (Redshift figures it out based on size of data)
- EVEN (rows distributed across slices in round-robin)
- KEY (rows distributed based on one column)
- ALL (entire table is copied to every node)
Redshift Sort Key
- rows are stored on disk in sorted order based on the column you designate as a sort key
- like an index
- makes for fast range queries
- choosing a sort key
- single vs compound vs interleaved
Redshift Importing Exporting Data
- COPY command
- parallelized and efficient
- from s3, emr, DynamoDB, remote host
- S3 requires a manifest file and IAM role
- UNLOAD command
- unload from a table into files in S3
Redshift COPY Command
- Use COPY to load large amounts of data from outside of Redshift
- If your data is already in Redshift in another table,
- use INSERT INTO … SELECT
- or CREATE TABLE AS
- COPY can decrypt data as it is loaded from S3
- hardware-accelerated SSL used to keep it fast
- gzip, lzop and bzip2 compression supported to speed it up further
- automatic compression option
- analyze data and figures out optimal compression scheme for storing it
- Special Use Case : Narrow tables (lots of row, few columns)
- load with a single COPY transaction if possible
- otherwise hidden metadata columns consume too much space
Redshift DBLINK
- Connect Redshift to PostgreSQL
- Good way to copy and sync data between PostgreSQL and Redshift
Redshift Workload Management
- Prioritize short, fast queries vs long, slow queries
- Creates up to 8 queues
- default 5 queues with even memory allocation
- configuring query queue
- priority
- concurrency scaling mode
- user groups
- query groups
- query monitoring rules
Redshift Manual Workload Management
- One default queue with concurrency level of 5 (5 queries at once)
- Superuser queue with concurrency level 1
- Define up to 8 queues, up to concurrency level 50
Redshift Short Query Acceleration (SQA)
- Prioritize short-running queries over long running ones
- Short queries run in a dedicated space, won’t wait in queue behind long queries
- Can be used in place of WLM queues for short queries
- con configure how many seconds is short
Redshift Resizing Clusters
- Elastic Resize
- quickly add or remove nodes of same type
- cluster is down for a few mins
- Classic Resize
- change node type or number of nodes
- cluster is read-only for hours to days
- Snapshot, restore, resize
- used to keep cluster available during a classic resize
Redshift VACUUM
- recovers space from deleted rows
- VACUUM FULL
- Sorts the specified table and reclaims disk space occupied by rows that were marked for deletion by previous UPDATE and DELETE operations
- VACUUM DELETE ONLY
- Reclaims disk space without sorting
- VACUUM SORT ONLY
- Sort specified table without reclaiming disk space
- VACUUM REINDEX
- Analyze distribution of sort key then performs a full VACUUM
Redshift New Features
- RA3 nodes with managed storage
- enable independent scaling of compute and storage
- ssd based
- redshift data lake export
- unload Redshift query to s3 in Apache Parquet format
- parquet is 2x faster to unload and consumes up to 6x less storage
- spatial data types
Redshift AQUA
- Advanced query accelerator
- pushes reduction and aggregation queries closer to the data
- up to 10x faster, no extra cost, no code changes
- benefits from high-bandwidth connection to s3
Redshift Anti Pattern
- small data sets
- OLTP
- unstructured data
- BLOB data
Redshift Security
- Using a Hardware Security Module (HSM)
- must use a client and server certificate to configure a trusted connection between Redshift and HSM
Redshift Serverless
- Automatic scaling and provisioning for your workload
- Optimizes costs and performance
- Uses ML to maintain performance across variable and sporadic workloads
- Easy spin up dev and test env
- Easy ad-hoc business analysis
Redshift Monitoring
- Monitoring views
- SYS_QUERY_HISTORY
- SYS_LOAD_HISTORY
- SYS_SERVERLESS_USAGE
- CloudWatch logs
- CloudWatch metrics
Amazon RDS
- Hosted relational database
- Aurora, MySQL, PostgreSQL, Oracle, etc
- Not for big data
RDS ACID
- Atomicity
- Consistency
- Isolation
- Durability
Amazon Aurora
- MySQL and PostgreSQL compatible
- up to 5x faster than MySQL, 3x faster than PostgreSQL
- 1/10 the cost of commercial database
- Up to 64TB per database instance
- Up to 15 read replicas
- Continuous backup to s3
- Replication across availability zones
- Automatic scaling with Aurora Serverless
Aurora Security
- VPC
- EAR : KMS
- EIF : SSL
Amazon QuickSight
- Business analytics service
- allows all users
- build visualization
- perform ad-hoc analysis
- quickly get business insights from data
- serverless
QuickSight SPICE
- Data sets are imported into SPICE
- super-fast, parallel, in-memory calculation engine
- user columnar storage, in-memory, machine code generation
- accelerates interactive queries on large data sets
- each user gets 10GB of SPICE
- highly available or durable
- scales to hundreds of thousands of users
QuickSight Use Cases
- Interactive ad-hoc exploration / visualization of data
- dashboard and KPI’s
- Analyze / visualize data from
- logs in s3
- on-premise databases
- AWS (RDS, Redshift, Athena, S3)
- SaaS applications such as Salesforce
QuickSight Anti Pattern
- highly formatted canned reports
- ETL
QuickSight Security
- VPC
- Multi-Factor Authentication
- Row-level security
- Column-level security (Enterprise edition only)
QuickSight + Redshift Security
- By default QuickSight can only access data stored in the same region as one QuickSight is running within
- Problem : QuickSight in region A, Redshift in region B
- Solution : create a new security group with an inbound rule authorizing access from the IP range of QuickSight servers in that region
QuickSight User Management
- Users defined via IAM or email signup
- Active Directory connector with QuickSight Enterprise Edition
QuickSight Pricing
- Annual Subscription
- Standard : $9 / month / user
- Enterprise $18 / month / user
- Extra SPICE capacity
- $0.25 (standard) 0.38(Enterprise) /GB /user /month
QuickSight Dashboards
- read only snapshots of an analysis
- can share with others with QuickSight access
- can share even more widely with embedded dashboards
- embed within an application
QuickSight Machine Learning Insights
- ML powered anomaly detection
- ML powered forecasting
- Autonarratives