Domain 1 - Collection Flashcards by Haitao Jiang

Kinesis - with KPL PutRecords(), what happen if a single record fails?

The KPL PutRecords() operation often sends multiple records to the stream per request. If a single record fails, it is automatically added back to the KPL buffer and retried. The failure of one record does not impact the processing of other records in the request.

How well did you know this?

Not at all

Perfectly

You are sending a lot of 100B data records and would like to ensure you can use Kinesis to receive your data. What should you use to ensure optimal throughput, that has asynchronous features ?

1) Kinesis SDK
2) Kinesis Producer Library
3) Kinesis Client Library
4) Kinesis Connector Library
5) Kinesis Agent

2) KPL - Through batching (collection and aggregation), we can achieve maximum throughput using the KPL. KPL is also supporting an asynchronous API

How well did you know this?

Not at all

Perfectly

You would like to collect log files in mass from your Linux servers running on premise. You need a retry mechanism embedded and monitoring through CloudWatch. Logs should end up in Kinesis. What will help you accomplish this?

1) Kinesis SDK
2) KPL
3) Kinesis Agent
4) Direct Connect

3) Kinesis Agent

How well did you know this?

Not at all

Perfectly

You would like to perform batch compression before sending data to Kinesis, in order to maximize the throughput. What should you use?

1) Kinesis SDK
2) KPL’s Compression Feature
3) KPL + Implement Compression Yourself

3) Compression must be implemented by the end user

How well did you know this?

Not at all

Perfectly

You have 10 consumers applications consuming concurrently from one shard, in classic mode by issuing GetRecords() commands. What is the average latency for consuming these records for each application?

1) 70ms
2) 200ms
3) 1 sec
4) 2 sec

4) 2 sec - You can issue up to 5 GetRecords API calls per second, so it’ll take 2 seconds for each consuming application before they can issue their next call

How well did you know this?

Not at all

Perfectly

You have 10 consumers applications consuming concurrently from one shard, in enhanced fan out mode. What is the average latency for consuming these records for each application?

1) 70ms
2) 200ms
3) 1 sec
4) 2 sec

1) 70ms - in enhanced fan out mode, each consumer will receive 2MB per second of throughput and have an average latency of 70ms.

Note: EFO has a limit of 20 registered consumers

How well did you know this?

Not at all

Perfectly

You are consuming from a Kinesis stream with 10 shards that receives on average 8 MB/s of data from various producers using the KPL. You are therefore using the KCL to consume these records, and observe through the CloudWatch metrics that the throughput is 2 MB/s, and therefore your application is lagging. What’s the most likely root cause for this issue?

1) You need to split shards some more
2) There is a hot partition
3) Cloudwatch is displaying the average throughput not aggregated one
4) Your DynamoDB is under-provisioned

4) DynamoDB is for check pointing by KCL. Because it’s under provisioned, checkpointing does not happen fast enough and results in a lower throughput for your KCL based application. Make sure to increase the RCU / WCU

10 shards = 1MB / Shard ingest and 2MB/shard egress capacity. So 1) and 2) can not be the answers

How well did you know this?

Not at all

Perfectly

Which of the following statement is wrong?

1) Spark Streaming can write to Kinesis Data Stream
2) Spark Streaming can read from Kinesis Firehose
3) Spark Streaming can read to Kinesis Data Stream

I took a guess on this one since Spark stream was never mentioned when we talked about KDF

How well did you know this?

Not at all

Perfectly

You are looking to decouple jobs and ensure data is deleted after being processes. Which technology would you choose?

1) Kinesis Data Streams
2) Kinesis Data Firehose
3) SQS

3) SQS

The key hint is “decouple jobs”

How well did you know this?

Not at all

Perfectly

Which protocol is not supported by the IoT Device Gateway?

1) MQTT
2) Websockets
3) HTTP 1.1
4) FTP

4) FTP

I took a guess. HTTP 1.1 can be a trap

How well did you know this?

Not at all

Perfectly

You would like to control the target temperature of your room using an IoT thing thermostat. How can you change its state for target temperature even in the case it’s temporarily offline?

1) Send a message to the IOT brokerevery 10s until it is acknowledged by the IOT thing
2) use a rule actions that triggers when the device come back online
3) Change the state of the device shadow
4) Change its metadata in the thing registry

3) That’s precisely the purposes of the device shadow, which gets synchronized with the device when it comes back online

How well did you know this?

Not at all

Perfectly

You have setup Direct Connect on one location to ensure your traffic into AWS is going over a private network. You would like to setup a failover connection, that must be as reliable and as redundant as possible, as you cannot afford to be down for too long. What backup connection do you recommend?

1) Another Direct Connect Setup
2) Site to site VPN
3) Client side VPN
4) Snowball Connection

2) Site to Site VPN - although this is not as private as another Direct Connect setup, it is definitely more reliable as it leverages the public web. It is the correct answer here
1) Another Direct Connect is more secure than Site to Site VPN, it is less reliable as it does not leverage the public web. It is the wrong answer here, but a correct answer overall for setting up highly available Direct Connect

How well did you know this?

Not at all

Perfectly

You would like to transfer data in AWS in less than two days from now. What should you use?

1) Set up Dx
2) Use Public Internet
3) Use AWS Snowball
4) Use AWS Snowmobile

2) Public Internet
1) Dx - When you create a public virtual interface, it can take up to 72 hours for AWS to review and approve your request.
3) and 4) obviously take too long

How well did you know this?

Not at all

Perfectly

From which sources can the input for Kinesis analytics be obtained ?

1) MySQL and Kinesis Data Streams
2) DynamoDB and Kinesis Firehose Deliver Streams
3) Kinesis data streams and Kinesis Firehose delivery streams
4) Kinesis data streams and DynamoDB

3) Kinesis Analytics can only monitor streams from Kinesis, but both data streams and Firehose are supported.

How well did you know this?

Not at all

Perfectly

After real-time analysis has been performed on the input source, where may you send the processed data for further processing?

1) Amazon S3
2) Redshift
3) Athena
4) Kinesis data stream or Firehose
5) All above

4) Kinesis Analytics can have Kinesis data stream, Kinesis Firehose delivery stream, and Lamda as destinations

How well did you know this?

Not at all

Perfectly

If a record arrives late to your application during stream processing, what happens to it?

1) The record is written to the error stream
2) The record is processed with later timestamp
3) The record is discarded entirely
4) None of above

Study These Flashcards

You have heard from your AWS consultant that Amazon Kinesis Data Analytics elastically scales the application to accommodate the data throughput. What though is default capacity of the processing application in terms of memory?

1) 48GB
2) 12GB
3) 24GB
4) 32GB

Study These Flashcards

4) 32GB
Kinesis Data Analytics provisions capacity in the form of Kinesis Processing Units (KPU). A single KPU provides you with the memory (4 GB) and corresponding computing and networking. The default limit for KPUs for your application is 8.

You have configured data analytics and have been streaming the source data to the application. You have also configured the destination correctly. However, even after waiting for a while, you are not seeing any data come up in the destination. What might be a possible cause?

1) Issue with IAM role
2) Mismatch name for the output stream
3) Destination service is currently not available
4) Any of above

Study These Flashcards

4) Any of above

Which are supported ways to import data into your Amazon ES domain?

1) Directly from a RDS instance
2) Via Kinesis, Logstash, and Elasticsearch API
3) Via Kinesis, SQS, and Beats
4) Via SQS, Firehose, and Logstash

Study These Flashcards

2) Kinesis, DynamoDB, Logstash / Beats, and Elasticsearch’s native API’s offer means to import data into Amazon ES.

What can you do to prevent data loss due to nodes within your ES domain failing?

1) Raise a ticket with AWS Support
2) Use Mutli-AZ balancing feature
3) Maintain snapshot of the Elasticsearch domain
4) Enable serverless mode in Amazon ES

Study These Flashcards

3) Amazon ES created daily snapshots to S3 by default, and you can create them more often if you wish.

You should also keep index replica

You are going to setup an Amazon ES cluster and have it configured in your VPC. You want your customers outside your VPC to visualize the logs reaching the ES using Kibana. How can this be achieved?

1) Use a reverse proxy
2) Use a VPN
3) Use VPC Direct Connect
4) Any of above

Study These Flashcards

What are two columnar data formats supported by Athena?

1) GZIP and LZO
2) AVRO and Parquet
3) AVRO and ORC
4) Parquet and ORC

Study These Flashcards

4) Parquet and ORC

GZIP and LZO are compression algorithms

Avro is a row-oriented remote procedure call and data serialization framework developed within Apache’s Hadoop project

Your organization is querying JSON data stored in S3 using Athena, and wishes to reduce costs and improve performance with Athena. What steps might you take?

1) Reduce the server count for Athena
2) Convert the data from JSON to ORC format and then analyze the ORC data with Athena
3) Convert the data from JSON to AVRO format and then analyze the AVRO data with Athena
4) Enable slow logging for Athena

Study These Flashcards

2) Using columnar formats such as ORC and Parquet can reduce costs 30-90%, while improving performance at the same time.

When using Athena, you are charged separately for using the AWS Glue Data Catalog. True or False ?

Study These Flashcards

True

Which of the following statements is NOT TRUE regarding Athena pricing? 1) Athena charge you for cancelled queries 2) Athena charge you for failed queries 3) You will get charged less when you use columnar format 4) Athena is priced per query and charges based on the amount of data scanned by the query

2) You will not get charged for DDL and failed queries

You are working for a data warehouse company that uses Amazon RedShift cluster. It is required that VPC flow logs is used to monitor all COPY and UNLOAD traffic of the cluster that moves in and out of the VPC. Which of the following helps you in this regard ? 1) By using Redshift Spectrum 2) By enabling Enhanced VPC routing on the Amazon Redshift cluster 3) By using Redshift WLM 4) By enabling audit logging in the Redshift cluster

2) Enhanced VPC routing for Redshift forces all COPY and UNLOAD traffic between your cluster and your data repositories through your Amazon VPC. Otherwise, traffic will go through Internet

You are working for a data warehousing company that has large datasets (20TB of structured data and 20TB of unstructured data). They are planning to host this data in AWS with unstructured data storage on S3. At first they are planning to migrate the data to AWS and use it for basic analytics and are not worried about performance. Which of the following options fulfill their requirements? 1) ds2.xlarge 2) ds2.8xlarge 3) dc2.xlarge 4) dc2.8xlarge

1) ds2.xlarge This is a strange question since DS2 is legacy node. DS2 is for HDD storage and has 2 node types 2TB/node and 16TB/node. The hint is that it is just 40TB of data, DS2.8xlarge is a overkill in this case

Which of the following services allows you to directly run SQL queries against exabytes of unstructured data in Amazon S3? 1) Athena 2) Redshift Spectrum 3) Elasticcache 4) RDS

2) Redshift Spectrum

``` #KDS What is Shard? ```

For KDS, streams are divided into shards. | Shard is a unit of capacity - each shard has 1MB write capacity and 2MB read capacity.

Domain 1 - Collection Flashcards

(29 cards)