Analytics Flashcards

Question

Can AWS Glue clean and transform streaming data in-flight?

Answer 1

It is a visual interface for ETL workflows

Answer 2

In the Glue Studio Monitoring console

Answer 3

It evaluates the data based on rules that you set. It uses DQDL for custom development.

Answer 4

A visual preparation tool for transforming data.

Answer 5

S3, data warehouse, database

Answer 6

It is a saved set of transformations that can be applied to any dataset

Answer 7

You can use custom SQL to create datasets

Answer 8

Yes, but only with customer master keys KMS SSE-C

Answer 9

Substitution Shuffling Deterministic_Encrypt Probablistic Encryption NULL or DELETE MASK OUT HASH

Answer 10

It only fires an event when a specific number of events or seconds within a time period are exceeded.

Answer 11

It makes it easy to set up a secure data lake in days.

Answer 12

Anything that you can do in Glue. It is built on Glue.

Answer 13

Athena, Redshift, and EMR

Answer 14

Yes. The recipient must be a data lake administrator. You can leverage AWS RAM for this as well.

Answer 15

They support ACID transactions across multiple tables. This cannot be changed once enabled. Also works with Kinesis streaming data.

Answer 16

Automatic compaction

Answer 17

Granular row and column level access

Answer 18

SAML or external AWS accounts

Answer 19

Databases, tables, or columns

Answer 20

They provide column, row, or cell level security. This done when granting select permissions on tables.

Answer 21

A query service for your data in S3. Data all stays in S3

Answer 22

ORC, Parquet, and Avro

Answer 23

They organize users, teams, and applications into groups. You can control access and track costs by workgroup. They integrate with IAM, CoudWatch, and SNS

Answer 24

Yes. You can limit how much data is returned.

Answer 25

Yes. Only failed queries are not billable.

Answer 26

Use a columnar format such as ORC or Parquet. You will scan less data.

Answer 27

Yes. A small number of large files performs better than a large number of small files.

Answer 28

Run MSCK REPAIR TABLE

Answer 29

You can recover data recently deleted with a SELECT statement

Answer 30

Optimize table command using bin_pack where catalog = N

Answer 31

Database and table level.

Answer 32

Yes, using MLLib

Answer 33

Yes. It integrates with Kinesis and Kafka

Answer 34

Yes, using CTAS and the format attribute.

Answer 35

It just keeps appending to a table and you query by using windows of time.

Answer 36

A managed hadoop framework that runs on EC2. Uncludes Spark, HBase, Presto, Flink, Hive, and more

Answer 37

Browser based development in a notebook.

Answer 38

Master, Core, and Task node

Answer 39

On the core nodes in HDFS

Answer 40

Use spot instances for task nodes since they do not persist data.

Answer 41

One that terminates once all the steps are complete. Good for cost savings.

Answer 42

One that must be manually terminated. A good use of reserved instances for cost savings.

Answer 43

When the cluster is launched.

Answer 44

By connecting to the master node or submitting jobs via ordered steps in the console.

Answer 45

The AWS Data Pipeline

Answer 46

Allow you to access S3 as if it was HDFS

Answer 47

Uses for consistency and uses DynamoDB to track consistency. S3 is now strongly consistent in 2021.

Answer 48

Yes. It will be ephemeral though. Can only be attached when launching the cluster.

Answer 49

By the hour

Answer 50

You can add task nodes on the fly as long as you do not also need to increase storage capacity.

Answer 51

Resize the cluster core nodes.

Answer 52

Adds core nodes and then task nodes up to the max units specified. It also scales down to your configured value.

Answer 53

Spot nodes (task and then core)

Answer 54

Yes. Without configuring this, EMR will calculate the value on its own.

Answer 55

API calls. This is not automatic!!

Answer 56

Yes. It can run alongside other applications.

Answer 57

A partition key and a datablob (up to 1MB)

Answer 58

1MB per second or 1000msg per second per shard.

Answer 59

A partition key, sequence numvber, and a datablob (up to 1MB)

Answer 60

Shared Mode 2MB per second per shard across all shards. Enhanced Mode 2MB per second per shard per consumer

Answer 61

Between 1 and 365 days

Answer 62

You choose the number of shards and scale manually.

Answer 63

Automatically scales based on observed throughput. This is 4MB/s or 4K per second

Answer 64

Using Putrecords for batching.

Answer 65

Low throughput, higher latency.

Answer 66

CloudWatch, AWS IoT, and Kinesis Data Analytics

Answer 67

Synchronous and asynchronous

Answer 68

RecordMaxBufferedTime

Answer 69

10MB or up to 10000 records

Answer 70

It marks your progress.

Answer 71

You need to increase the WCU of DynamoDB

Answer 72

It sends data to S3, DynamoDb, RedShift, Opensearch, etc.. It lives on an EC2 instance. Kind of deprecated.

Answer 73

It uses HTTP/2 to push to consumers.

Answer 74

Less than 70ms

Answer 75

When there is a low number of consumers You can tolerate 200ms latency Cost effective

Answer 76

when you have multiple consumer applications for the same stream Low Latency Higher Cost

Answer 77

20, but you can ask for a service request to increase it.

Answer 78

Two new shards are created The old shard will go away when the data expires

Answer 79

One shard is created The old shards will go away when the data expires

Answer 80

Resharding can cause this. make sure you read entirely from the parent before reading from the new records. This is built into the KCL

Answer 81

One.. This is a problem when you have thousands of shards.

Answer 82

Network Timeouts. Use unique IDs to deduplicate records on the consumer side.

Answer 83

A worker terminates unexpectedly A worker instance is added or removed Shards are merged or split The application is deployed

Answer 84

make your application idempotent Handle duplicates at the final destination

Answer 85

With a Lambda in Kinesis Data Firehose

Answer 86

Yes. It loads to S3 first and hen issues a COPY command.

Answer 87

Yes as long as there is an HTTP endpoint

Answer 88

Yes. All or failed data can be stored in S3 before the data is sent to S3 in a batch write.

Answer 89

60 seconds

Answer 90

Limted, but yes. JSON to ORC but only for S3.. Others are done using Lambda.

Answer 91

Yes using Gzip, zip, or snappy

Answer 92

The buffer size and buffer time. Whichever limit is hit first.

Answer 93

Buffer size is a few MB Buffer time is 1 minute

Answer 94

Kinesis streams with a Lambda to send the data to OpenSearch

Answer 95

A subscription filter allows you to connect to other AWS services like Lambda, Data Streams, etc..

Answer 96

Yes. This can be used to encrypt, translate to another format, aggregate rows, etc..

Answer 97

Dynamo DB, Aurora, SNS, SQS, Cloudwatch

Answer 98

Managed Service for Apache Flink

Answer 99

Streaming ETL Continuous metric generation Responsive analytics

Answer 100

It analyzes the schema real-time

Answer 101

It detects anomalies in your data.

Answer 102

Managed streaming for Apache Kafka. An alternative to Kinesis.

Answer 103

10MB. This is much larger than Kinesis Data Streams at 1MB

Answer 104

Yes. It uses EBS volumes and is more flexible than Kinesis Data Streams.

Answer 105

Yes. This can be done using: Mutual TLS and Kafka ACLs IAM Access control SASL/SCRAM and Kafka ACLs

Answer 106

It allows you to connect to other AWS services for delivery such as S3, Redshift, Opensearch, etc..

Answer 107

Used to be known as Elasticsearch. Petabyte scale analysis and reporting .

Answer 108

Full-Text searching Log analytics Application Monitoring Security Analytics

Answer 109

They define the schema and mapping shared by documents.

Answer 110

An index. They contain inverted indices that you search across everything within them at once.

Answer 111

They are split into shards and documents are hashed to a particular shard. Shards can be on different nodes in a cluster.

Answer 112

Yes, using replicas.

Answer 113

It is essentially the cluster.

Answer 114

Snapshot to S3

Answer 115

Both. It also supports request signing and IP based policies.

Answer 116

Using Cognito / SAML, Reverse Proxy, SSH, VPC Direct Connect, or a VPN

Answer 117

Hot storage. This is an instance store or EBS volume.

Answer 118

It uses S3 and Caching Best for indices with few writes (log data / immutable data) slower performance requires a dedicated master node

Answer 119

Uses S3 Best for periodic research or forensic analysis on older data Must have dedicated master node UltraWarm must also be enabled

Answer 120

Automates index management policies: automates snapshots deletes indices over a period of time Move indices from hot to cold over time Reduce Replica Count

Answer 121

Every 30 - 48 minutes

Answer 122

They roll up old data into summarized indices. New index may have fewer fields. Good to save on storage.

Answer 123

Like rollups, but purpose is to create a different view to analyze the data differently.

Answer 124

It pulls from the leader index to replicate data.

Answer 125

Remote Reindex

Answer 126

Have three

Answer 127

Delete old or unused indices.

Answer 128

On-Demand autoscaling

Answer 129

search or time series

Answer 130

Redshift, Aurora, RDS, Athena, OpenSearch, IoT Analytics, Your own database, raw files like csv, excel, log files, etc...

Answer 131

Very light ETL.

Answer 132

Your datasets get imported into spice. Each user gets 10GB of Spice. It accelerates large queries.

Answer 133

It times out.

Answer 134

Ad-hoc exploration and visualization Dashboards and KPIs

Answer 135

Yes. Row level security is available in standard, but column level security is only available in the enterprise edition.

Answer 136

You need to make sure QuickSight can access your data. You need to create IAM policies that restrict what data in S3 users can see.

Answer 137

No. Quicksight can only acces Redshift data in the same region.

Answer 138

Use an inbound security group to allow access to Redshift from the Quicksight IP range.

Answer 139

Enterprise Edition

Answer 140

Use private subnets and peering connections. Route tables will tie it together. It can be used for cross account access using transit Gateway

Answer 141

Enterprise edition

Answer 142

No. Enterprise edition allows you to use KMS.

Answer 143

An NLP interface on top of QuickSight

Answer 144

Yes. It is billed by additional GB of spice needed.

Answer 145

Yes, using the Javascript SDK

Answer 146

Domain Whitelisting

Answer 147

anomaly detection forecasting - seasonality and trends over times. imputes missing values autonarratives - a story of your data in paragraph format. Suggested insights - helps decide which feature is right for your dataset.

Analytics Flashcards

(201 cards)