Tutorial Dojo Flashcards
AWS Data exchange
-3rd party datasets in s3
-accessed via API GetDataSet
Redshift concurrent scaling
and workload management
-handles concurrent users/unpredictability like BI workloads
-WLM can set query priority
-WLM up to 8 queues & eachqueue max concurrency of 50
Athena workgroups
-organize and manage queries
-can use Apache spark for analytics
-security and access control
AWS datasync
-on-prem to aws file storage like s3
S3 event notification
-event type=ObjectCreated for example
-can trigger lambda based off for example suffix .csv
AWS Glue for Ray
-job type for scale AI and Python and native library
-ray dataset based on Apache arrow
Aws managed services for Apache flink
-for real time, time series analysis
-sliding window for intervals or overlapping
Object lambda
-add code to get request enables real-time transformation as data retrieved
-on the fly
Aws Graviton instance
-custom aws for best price performance for workloads
Lambda provisioned concurrency
-setting scale without latency
Redshift data sharing
-share read access across clusters, workgroups, accounts, regions
-live data
-what to check state machine fails to start at a step?
- state machines iam role
Glue’s sensitive data detection feature
-auto recognize PII AND redact
S3 VPC gateway endpoint
-specify as target route in route table for traffic destined to s3
Athena federated query
-connectors using lambda
-nosql, sql, timestream, etc
Kenisis reporting with redshift real time
-create external schema for data stream
-materialized view referencing schema w auto refresh
Stl_alert_event_log
-redshift view help identify performance issues and solution
Glue resource policy
-think finance and hr running own etl and access own dbs
S3 access point
-for multiple application access
-for cross-account access
-works with bucket policy
MSCK Repair table
-Athena query when new data added to existing partition
-makes new partitions visible but does not necessarily speed up performance
EFS and lambda
-mounts to efs seamlessly
Improve kinesis Performance when processing
-add shards
-config parallel satin
-reg lambda func as consumer w enhanced fan-out
-exponential backoff and retry?
Glue catalog partition predicates (frame)
&
Push down predicate
-server side filtering during frame creation (before data even loaded)
-faster than client side where data loaded in memory
-push down is similar but no mention of partition
Transient EMR clusters
-think batch jobs
-cluster created then terminated after
SQS settings
DelaySeconds -how long before visible in queue
VisibilityTimeout -prevents multiple receive/processed
MaxRecieveCount -amt of times a msg can be received before deleted
Athena notebooks
-interactive python coding environment
-execute spark code visually
CloudWatch container insights
-for microservices and container apps
OpenSearch storage
-hot = fastest access expensive
-ultra warm = less accessed cheaper
-cold = infrequent access can attach to ultra warm
Stored proc and aurora
-can run proc in aurora to trigger lambda when loan is approved for example.
MSK kafka ACLs
-microservices
-which apps read/write diff topics
Cloud trail data events vs management events
-data events = executions/s3 put example..
-management events = deleting resources
Glue DataBrew masking techniques
-substitution = aron changed to donny
-probablistic = different ciphertext each time
-nulling deleting
Athena Partition projection
-helps query performance focusing on subsets
-good to run when already partitioned and data is growing.
Redshift distribution style
-EVEN=rows even across node. Good when no joins/no clear dist key
-KEY=rows w same key stored together. Good for query frequently filter or joined on spec column
-ALL = full copy to each node. Best for small static table
-AUTO = may change over time or not clear
Sagemaker canvas
-no-code visual canvas.
-simplifies whole process from cleaning to prediction
Redshift vacuum commands
VACUUM FULL -same as vacuum
VACUUM DELETE ONLY -doesn’t speed up performance just reclaims disc space
VACUUM REINDEX -analyze interleaved sort key and performs vacuum.
VACUUM SORT ONLY -sorts w out reclaiming disk. Used when rows unsourced but space not an issue
Sagemaker workflows/lineage tracking
-save steps in workflow
-visually think step function editor
CloudWatch contributor insights & dynamodb
-view of dynamodb traffic trends
Dynamodb cardinality key
-when throttling issues use high cardinality key, so more evenly distributed.
-for hot partition issues
RDS performance insights
-gathers performance metrics