Analysis Flashcards

Question 1

Q

Amazon Machine Learning

Answer

A

Provides visualization tools and wizards to make creating a model easy
Fully managed
Outdated now

Question 2

Q

Amazon Machine Learning Cost Model

Answer

A

Charged for compute time

Question 3

Q

Amazon Machine Learning Promises

Answer

A

No downtime
Up to 100GB training data
Up to 5 simultaneous jobs

Question 4

Q

Amazon Machine Learning Anti Pattern

Answer

A

Terabyte-scale data
Unsupported learning tasks
- sequence prediction
- unsupervised clustering
- deep learning

Question 5

Q

AWS SageMaker

Answer

A

Build, Train and Deploy models
Tensorflow, Apache MXNet
GPU accelerated deep learning
Scaling effectively unlimited
hyperparameter tuning jobs

Question 6

Q

AWS SageMaker Security

Answer

A

Code stored in “ML storage volumes”
All artifacts encrypted in transit and at rest
API and console secured by SSL
KMS integration for SageMaker notebook, training jobs, endpoints

Question 7

Q

Deep Learning on EC2 / EMR

Answer

A

EMR supports Apache MXNet and GPU Instance types
Appropriate instance types for deep learning
- P3 : 8 Tesla V100 GPU
- P2 : 16 K80 GPU
- G3 : 4 M60 GPU
Deep Learning AMI’s

Question 8

Q

AWS Data Pipeline

Answer

A

Manages task dependencies
Retries and notifies on failures
Highly available
Destination : S3, RDS, DynamoDB, Redshift, EMR

Question 9

Q

Kinesis Data Analytics

Answer

A

Fully managed and serverless
Transform, analyze streaming data in real time with Aapche Flink
Reference tables are inexpensive to join data for quick lookups
Use Flink under the hood
- Flink is a framework for processing data streams
- Kinesis Data Analytics integrates Flink with AWS
Use Cases : Continuous metric generation, responsive real-time analytics, etc
1KPU = 1 vCPU and 4GB memory

Question 10

Q

Kinesis Data Analytics + Lambda

Answer

A

Post processing
- aggregate row, translating to different formats, transforming and enriching data

Question 11

Q

Kinesis Data Analytics Use Cases

Answer

A

Streaming ETL
Continuous metric generation
Responsive analysis

Question 12

Q

RANDOM_CUT_FOREST

Answer

A

SQL function used for anomaly detection on numeric columns in a stream

Question 13

Q

Amazon Opensearch Service (Formerly ElasticSearch)

Answer

A

A fork of ElasticSearch and Kibana
A search engine
Fully managed
Scale up and down without downtime

Question 14

Q

OpenSearch Use Cases

Answer

A

Full text search
Log analytics
Application monitoring
Security analytics
Clickstream analytics

Question 15

Q

OpenSearch Concepts

Answer

A

Documents
- docs are hashed to a particular shard
Indices
- Index has primary shard and 2 replicas
- Application should make request round-robin amongst nodes
Write requests are routed to primary shard, then replicated
Read requests are routed to primary or any replicas

Question 16

Q

OpenSearch Options

Answer

A

Dedicated master node(s)
Choice of count and instance types
Domains
Zone Awareness

Question 17

Q

OpenSearch Cold Warm UltraWarm Hot Storage

Answer

A

Standard data use “hot” storage
- instance stores or EBS volumes
UltraWarm and Warm storage uses S3+caching
Cold Storage
- Use s3
- Must have dedicated master and have ultrawarm enabled too
Data may be migrated between different storage types

Question 18

Q

OpenSearch Index State Management

Answer

A

Automates index management policies
Example
- delete old indices after a period time
- move indices from hot -> ultra warm -> warm -> cold storage over time
- Automate index snapshot
ISM policies are run every 30-48 minutes
Index rollups
- periodically roll up old data into summarized indices
- saves storage costs
- new index may have fewer fields, coarser time buckets
index transform
- to create a different view to analyze data differently
- groupings and aggregations

Question 19

Q

OpenSearch Cross Cluster Replication

Answer

A

replicate indices / mappings / metadata across domains
replicate data geographically for better latency
“follower” index pulls data from “leader” index
- With cross-cluster replication, we index data to a leader index and OpenSearch replicates that data to one or more read-only follower indices
“remote reindex” allows copying indices from one cluster to another on demand

Question 20

Q

OpenSearch Stability

Answer

A

3 dedicated master nodes is best
- avoids “split brain”
do not run out of disk space
- minimum storage requirement is roughly : source data * (1 + num of replicas) * 1.45
Choosing the number of shards
Choosing instance types
- at least 3 nodes
- mostly abour storage requirements

Question 21

Q

OpenSearch Security

Answer

A

resource-based policies
identity based policies
VPC
Cognito

Question 22

Q

OpenSearch Anti Pattern

Answer

A

OLTP
ad-hoc data querying
OpenSearch is primarily for search and analytics

Question 23

Q

OpenSearch Performance

Answer

A

memory pressure in the JVM can result if
- unbalanced shard allocations across nodes
- too many shards in a cluster
Fewer shards can yield better performance if JVMMemoryPressure errors are encountered
- delete old or unused indices

Question 24

Q

Amazon Athena

Answer

A

serverless
interactive query service for s3 (SQL)
Presto under the hood
Supports many data formats
- csv, json, orc, parquet, avro
unstructured, semi-structured or structured

Question 25

Q

Amazon Athena Use Cases

Answer

A

ad-hoc queries of web logs
querying staging data before loading to redshift
analyze cloudtrail / cloudfront / vpc logs in s3
integration with Jupyter, Zeppelin, RStudio, QuickSight and other visualization tools

Question 26

Q

Athena Workgroups

Answer

A

can organize users / teams / apps / workloads into WORKGROUPS
can control query access and track costs by Workgroups
Each workgroup has its own
- query history
- data limits
- iam policies
- encryption settings

Question 27

Q

Athena Cost Model

Answer

A

Pay as you go
- $5 per TB scanned
- sccessful or cancelled queries count. Failed queries do not count
- No charge for DDL (CREATE/ALTER/DROP etc)
Save lots of money by using columnar formats
- orc, parquet
- save 30-90% and get better performance

Question 28

Q

Athena Security

Answer

A

Transport Layer Security (TLS) encrypts in-transit between Athena and S3

Question 29

Q

Athena Anti Pattern

Answer

A

Highly formatted reports / visualization
- QuickSight better
ETL
- use Glue instead

Question 30

Q

Athena Optimized Performance

Answer

A

Use columnar data (orc, parquet)
small number of large files performs better than large number of small files
Use partitions

Question 31

Q

Athena ACID transactions

Answer

A

Powered by Apache Iceberg
- Just add ‘table_type’ = ‘ICEBERG’ in create table statement
concurrent users can safely make row-level modifications
compatible with EMR, Spark, anything that supports Icebery format
removes need for custom record locking
time travel operations

Question 32

Q

Amazon Redshift

Answer

A

Fully managed, petabyte scale data warehouse
Designed for OLAP not OLTP
Cost effective
SQL, ODBC, JDBC interfaces
Scale up or down on demand
Built in replication and backups
Monitoring via CloudWatch / CloudTrail
Query exabytes of unstructured data in S3 without loading
limitless concurrency
Horizontal scaling
Separate compute and storage resources
Wide variety of data formats
Support of Gzip and Snappy compression

Question 33

Q

Redshift Use Cases

Answer

A

Accelerate analytics workloads
Unified data warehouse and data lake
Data warehouse modernization
Analyze global sales data
Store historical stock trade data
Analyze ad impressions and clicks
Aggregate gaming data
Analyze social trends

Question 34

Q

Redshift Performance

Answer

A

Massively Parallel Processing
Columnar Data Storage
Column Compression

Question 35

Q

Redshift Durability

Answer

A

Replication within cluster
Backup to S3 (Asynchronously replicated to antoher region)
Automated snapshots
Failed drives / nodes automatically replaced
However, limited to a single availability zone

Question 36

Q

Redshift Scaling

Answer

A

vertical and horizontal scaling on demand
during scaling
- a new cluster is created while your old one remains available for reads
- CNAME is flipped to new cluster (a few mins of downtime)
- data moved in parallel to new compute nodes
concurrency scaling
- automatically adds cluster capacity to handle increase in concurrent read queries
- support virtually unlimited concurrent users and queries

Question 37

Q

Redshift Distribution Styles

Answer

A

AUTO (Redshift figures it out based on size of data)
EVEN (rows distributed across slices in round-robin)
KEY (rows distributed based on one column)
ALL (entire table is copied to every node)

Question 38

Q

Redshift Sort Key

Answer

A

rows are stored on disk in sorted order based on the column you designate as a sort key
like an index
makes for fast range queries
choosing a sort key
- single vs compound vs interleaved

Question 39

Q

Redshift Importing Exporting Data

Answer

A

COPY command
- parallelized and efficient
- from s3, emr, DynamoDB, remote host
- S3 requires a manifest file and IAM role
UNLOAD command
- unload from a table into files in S3

Question 40

Q

Redshift COPY Command

Answer

A

Use COPY to load large amounts of data from outside of Redshift
If your data is already in Redshift in another table,
- use INSERT INTO … SELECT
- or CREATE TABLE AS
COPY can decrypt data as it is loaded from S3
- hardware-accelerated SSL used to keep it fast
gzip, lzop and bzip2 compression supported to speed it up further
automatic compression option
- analyze data and figures out optimal compression scheme for storing it
Special Use Case : Narrow tables (lots of row, few columns)
- load with a single COPY transaction if possible
- otherwise hidden metadata columns consume too much space

Question 41

Q

Redshift DBLINK

Answer

A

Connect Redshift to PostgreSQL
Good way to copy and sync data between PostgreSQL and Redshift

Question 42

Q

Redshift Workload Management

Answer

A

Prioritize short, fast queries vs long, slow queries
Creates up to 8 queues
- default 5 queues with even memory allocation
configuring query queue
- priority
- concurrency scaling mode
- user groups
- query groups
- query monitoring rules

Question 43

Q

Redshift Manual Workload Management

Answer

A

One default queue with concurrency level of 5 (5 queries at once)
Superuser queue with concurrency level 1
Define up to 8 queues, up to concurrency level 50

Question 44

Q

Redshift Short Query Acceleration (SQA)

Answer

A

Prioritize short-running queries over long running ones
Short queries run in a dedicated space, won’t wait in queue behind long queries
Can be used in place of WLM queues for short queries
con configure how many seconds is short

Question 45

Q

Redshift Resizing Clusters

Answer

A

Elastic Resize
- quickly add or remove nodes of same type
- cluster is down for a few mins
Classic Resize
- change node type or number of nodes
- cluster is read-only for hours to days
Snapshot, restore, resize
- used to keep cluster available during a classic resize

Question 46

Q

Redshift VACUUM

Answer

A

recovers space from deleted rows
VACUUM FULL
- Sorts the specified table and reclaims disk space occupied by rows that were marked for deletion by previous UPDATE and DELETE operations
VACUUM DELETE ONLY
- Reclaims disk space without sorting
VACUUM SORT ONLY
- Sort specified table without reclaiming disk space
VACUUM REINDEX
- Analyze distribution of sort key then performs a full VACUUM

Question 47

Q

Redshift New Features

Answer

A

RA3 nodes with managed storage
- enable independent scaling of compute and storage
- ssd based
redshift data lake export
- unload Redshift query to s3 in Apache Parquet format
- parquet is 2x faster to unload and consumes up to 6x less storage
spatial data types

Question 48

Q

Redshift AQUA

Answer

A

Advanced query accelerator
pushes reduction and aggregation queries closer to the data
up to 10x faster, no extra cost, no code changes
benefits from high-bandwidth connection to s3

Question 49

Q

Redshift Anti Pattern

Answer

A

small data sets
OLTP
unstructured data
BLOB data

Question 50

Q

Redshift Security

Answer

A

Using a Hardware Security Module (HSM)
- must use a client and server certificate to configure a trusted connection between Redshift and HSM

Question 51

Q

Redshift Serverless

Answer

A

Automatic scaling and provisioning for your workload
Optimizes costs and performance
Uses ML to maintain performance across variable and sporadic workloads
Easy spin up dev and test env
Easy ad-hoc business analysis

Question 52

Q

Redshift Monitoring

Answer

A

Monitoring views
- SYS_QUERY_HISTORY
- SYS_LOAD_HISTORY
- SYS_SERVERLESS_USAGE
CloudWatch logs
CloudWatch metrics

Question 53

Q

Amazon RDS

Answer

A

Hosted relational database
- Aurora, MySQL, PostgreSQL, Oracle, etc
Not for big data

Question 54

Q

RDS ACID

Answer

A

Atomicity
Consistency
Isolation
Durability

Question 55

Q

Amazon Aurora

Answer

A

MySQL and PostgreSQL compatible
up to 5x faster than MySQL, 3x faster than PostgreSQL
1/10 the cost of commercial database
Up to 64TB per database instance
Up to 15 read replicas
Continuous backup to s3
Replication across availability zones
Automatic scaling with Aurora Serverless

Question 56

Q

Aurora Security

Answer

A

VPC
EAR : KMS
EIF : SSL

Question 57

Q

Amazon QuickSight

Answer

A

Business analytics service
allows all users
- build visualization
- perform ad-hoc analysis
- quickly get business insights from data
serverless

Question 58

Q

QuickSight SPICE

Answer

A

Data sets are imported into SPICE
- super-fast, parallel, in-memory calculation engine
- user columnar storage, in-memory, machine code generation
- accelerates interactive queries on large data sets
each user gets 10GB of SPICE
highly available or durable
scales to hundreds of thousands of users

Question 59

Q

QuickSight Use Cases

Answer

A

Interactive ad-hoc exploration / visualization of data
dashboard and KPI’s
Analyze / visualize data from
- logs in s3
- on-premise databases
- AWS (RDS, Redshift, Athena, S3)
- SaaS applications such as Salesforce

Question 60

Q

QuickSight Anti Pattern

Answer

A

highly formatted canned reports
ETL

Question 61

Q

QuickSight Security

Answer

A

VPC
Multi-Factor Authentication
Row-level security
Column-level security (Enterprise edition only)

Question 62

Q

QuickSight + Redshift Security

Answer

A

By default QuickSight can only access data stored in the same region as one QuickSight is running within
Problem : QuickSight in region A, Redshift in region B
Solution : create a new security group with an inbound rule authorizing access from the IP range of QuickSight servers in that region

Question 63

Q

QuickSight User Management

Answer

A

Users defined via IAM or email signup
Active Directory connector with QuickSight Enterprise Edition

Question 64

Q

QuickSight Pricing

Answer

A

Annual Subscription
- Standard : $9 / month / user
- Enterprise $18 / month / user
Extra SPICE capacity
- $0.25 (standard) 0.38(Enterprise) /GB /user /month

Answer 65

A

read only snapshots of an analysis
can share with others with QuickSight access
can share even more widely with embedded dashboards
- embed within an application

Answer 66

A

ML powered anomaly detection
ML powered forecasting
Autonarratives