Analysis Flashcards
1
Q
Amazon Machine Learning
A
- Provides visualization tools and wizards to make creating a model easy
- Fully managed
- Outdated now
2
Q
Amazon Machine Learning Cost Model
A
- Charged for compute time
3
Q
Amazon Machine Learning Promises
A
- No downtime
- Up to 100GB training data
- Up to 5 simultaneous jobs
4
Q
Amazon Machine Learning Anti Pattern
A
- Terabyte-scale data
- Unsupported learning tasks
- sequence prediction
- unsupervised clustering
- deep learning
5
Q
AWS SageMaker
A
- Build, Train and Deploy models
- Tensorflow, Apache MXNet
- GPU accelerated deep learning
- Scaling effectively unlimited
- hyperparameter tuning jobs
6
Q
AWS SageMaker Security
A
- Code stored in “ML storage volumes”
- All artifacts encrypted in transit and at rest
- API and console secured by SSL
- KMS integration for SageMaker notebook, training jobs, endpoints
7
Q
Deep Learning on EC2 / EMR
A
- EMR supports Apache MXNet and GPU Instance types
- Appropriate instance types for deep learning
- P3 : 8 Tesla V100 GPU
- P2 : 16 K80 GPU
- G3 : 4 M60 GPU
- Deep Learning AMI’s
8
Q
AWS Data Pipeline
A
- Manages task dependencies
- Retries and notifies on failures
- Highly available
- Destination : S3, RDS, DynamoDB, Redshift, EMR
9
Q
Kinesis Data Analytics
A
- Fully managed and serverless
- Transform, analyze streaming data in real time with Aapche Flink
- Reference tables are inexpensive to join data for quick lookups
- Use Flink under the hood
- Flink is a framework for processing data streams
- Kinesis Data Analytics integrates Flink with AWS
- Use Cases : Continuous metric generation, responsive real-time analytics, etc
- 1KPU = 1 vCPU and 4GB memory
10
Q
Kinesis Data Analytics + Lambda
A
- Post processing
- aggregate row, translating to different formats, transforming and enriching data
11
Q
Kinesis Data Analytics Use Cases
A
- Streaming ETL
- Continuous metric generation
- Responsive analysis
12
Q
RANDOM_CUT_FOREST
A
- SQL function used for anomaly detection on numeric columns in a stream
13
Q
Amazon Opensearch Service (Formerly ElasticSearch)
A
- A fork of ElasticSearch and Kibana
- A search engine
- Fully managed
- Scale up and down without downtime
14
Q
OpenSearch Use Cases
A
- Full text search
- Log analytics
- Application monitoring
- Security analytics
- Clickstream analytics
15
Q
OpenSearch Concepts
A
- Documents
- docs are hashed to a particular shard
- Indices
- Index has primary shard and 2 replicas
- Application should make request round-robin amongst nodes
- Write requests are routed to primary shard, then replicated
- Read requests are routed to primary or any replicas
16
Q
OpenSearch Options
A
- Dedicated master node(s)
- Choice of count and instance types
- Domains
- Zone Awareness
17
Q
OpenSearch Cold Warm UltraWarm Hot Storage
A
- Standard data use “hot” storage
- instance stores or EBS volumes
- UltraWarm and Warm storage uses S3+caching
- Cold Storage
- Use s3
- Must have dedicated master and have ultrawarm enabled too
- Data may be migrated between different storage types
18
Q
OpenSearch Index State Management
A
- Automates index management policies
- Example
- delete old indices after a period time
- move indices from hot -> ultra warm -> warm -> cold storage over time
- Automate index snapshot
- ISM policies are run every 30-48 minutes
- Index rollups
- periodically roll up old data into summarized indices
- saves storage costs
- new index may have fewer fields, coarser time buckets
- index transform
- to create a different view to analyze data differently
- groupings and aggregations
19
Q
OpenSearch Cross Cluster Replication
A
- replicate indices / mappings / metadata across domains
- replicate data geographically for better latency
- “follower” index pulls data from “leader” index
- With cross-cluster replication, we index data to a leader index and OpenSearch replicates that data to one or more read-only follower indices
- “remote reindex” allows copying indices from one cluster to another on demand
20
Q
OpenSearch Stability
A
- 3 dedicated master nodes is best
- avoids “split brain”
- do not run out of disk space
- minimum storage requirement is roughly : source data * (1 + num of replicas) * 1.45
- Choosing the number of shards
- Choosing instance types
- at least 3 nodes
- mostly abour storage requirements
21
Q
OpenSearch Security
A
- resource-based policies
- identity based policies
- VPC
- Cognito
22
Q
OpenSearch Anti Pattern
A
- OLTP
- ad-hoc data querying
- OpenSearch is primarily for search and analytics
23
Q
OpenSearch Performance
A
- memory pressure in the JVM can result if
- unbalanced shard allocations across nodes
- too many shards in a cluster
- Fewer shards can yield better performance if JVMMemoryPressure errors are encountered
- delete old or unused indices
24
Q
Amazon Athena
A
- serverless
- interactive query service for s3 (SQL)
- Presto under the hood
- Supports many data formats
- csv, json, orc, parquet, avro
- unstructured, semi-structured or structured
25
Q
Amazon Athena Use Cases
A
- ad-hoc queries of web logs
- querying staging data before loading to redshift
- analyze cloudtrail / cloudfront / vpc logs in s3
- integration with Jupyter, Zeppelin, RStudio, QuickSight and other visualization tools
26
Q
Athena Workgroups
A
- can organize users / teams / apps / workloads into WORKGROUPS
- can control query access and track costs by Workgroups
- Each workgroup has its own
- query history
- data limits
- iam policies
- encryption settings