Analysis Flashcards

1
Q

From which sources can the input for Kinesis analytics be obtained ?

  • MySQL and Kinesis Data Streams
  • DynamoDB and Kinesis Firehose deliver streams
  • Kinesis data streams and Kinesis Firehose delivery streams
  • Kinesis Data Streams and DynamoDB
A

Kinesis data streams and Kinesis Firehose delivery streams (Kinesis Analytics can only monitor streams from Kinesis, but both data streams and Firehose are supported.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

After real-time analysis has been performed on the input source, where may you send the processed data for further processing?

A

Kinesis Data Stream or Firehose (While you might in turn connect S3 or Redshift to your Kinesis Analytics output stream, Kinesis Analytics must have a stream as its input, and a stream or Lambda function as its output.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

If a record arrives late to your application during stream processing, what happens to it?

A

The record is written to the error stream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

You have heard from your AWS consultant that Amazon Kinesis Data Analytics elastically scales the application to accommodate the data throughput. What though is default capacity of the processing application in terms of memory?

A

32 GB (Kinesis Data Analytics provisions capacity in the form of Kinesis Processing Units (KPU). A single KPU provides you with the memory (4 GB) and corresponding computing and networking. The default limit for KPUs for your application is eight.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

You have configured data analytics and have been streaming the source data to the application. You have also configured the destination correctly. However, even after waiting for a while, you are not seeing any data come up in the destination. What might be a possible cause?

  • Issue with IAM role
  • Mismatched name for the output stream
  • Destination service is currently unavailable
  • Any of above
A

Any of above

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can you ensure maximum security for your Amazon ES cluster?

  • Bind with a VPC
  • Use security groups
  • Use IAM policies
  • Use access policies associated with the Elasticsearch domain creation
  • All of the above
A

All of the above

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

As recommended by AWS, you are going to ensure you have dedicated master nodes for high performance. As a user, what can you configure for the master nodes?

  • The count and instance types of the master nodes
  • The EBS volume associated with the node
  • The upper limit of network traffic / bandwidth
  • All of the above
A

The count and instance types of the master nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which are supported ways to import data into your Amazon ES domain?

  • Directly from an RDS instance
  • Via Kinesis, Logstash, and Elasticsearch’s API’s
  • Via Kinesis, SQS, and Beats
  • Via SQS, Firehose, and Logstash
A

Via Kinesis, Logstash, and Elasticsearch’s API’s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What can you do to prevent data loss due to nodes within your ES domain failing?

A

Maintain snapshots of the Elasticsearch Service domain (Amazon ES created daily snapshots to S3 by default, and you can create them more often if you wish.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

You are going to setup an Amazon ES cluster and have it configured in your VPC. You want your customers outside your VPC to visualize the logs reaching the ES using Kibana. How can this be achieved?

  • Use a reverse proxy
  • Use a VPN
  • Use VPC
  • Use VPC Direct Connect
  • Any of the above
A

Any of the above

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

As a Big Data analyst, you need to query/analyze data from a set of CSV files stored in S3. Which of the following serverless services helps you with this?

  • AWS Glacier
  • AWS EMR
  • AWS Athena
  • AWS Redshift
A

AWS Athena

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are two columnar data formats supported by Athena?

A

Parquet and ORC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Your organization is querying JSON data stored in S3 using Athena, and wishes to reduce costs and improve performance with Athena. What steps might you take?

A

Convert the data from JSON to ORC format, and analyze the ORC data with Athena

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When using Athena, you are charged separately for using the AWS Glue Data Catalog. True or False ?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which of the following statements is NOT TRUE regarding Athena pricing?

  • Amazon Athena charges you for cancelled queries
  • Amazon Athena charges you for failed queries
  • You will get charges less when using a columnar format
  • Amazon Athena is priced per query and charges based on the amount of data scanned by the query
A

Amazon Athena charges you for failed queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

You are working as Big Data Analyst of a data warehousing company. The company uses RedShift clusters for data analytics. For auditing and compliance purpose, you need to monitor API calls to RedShift instance and also provide secured data.

Which of the following services helps in this regard ?

  • CloudTrail logs
  • CloudWatch logs
  • Redshift Spectrum
  • AmazonMQ
A

CloudTrail logs

17
Q

You are working as a Big Data analyst of a Financial enterprise which has a large data set that needs to have columnar storage to reduce disk IO. It is also required that the data should be queried fast so as to generate reports. Which of the following service is best suited for this scenario?

  • DynamoDB
  • RDS
  • Athena
  • Redshift
A

Redshift

18
Q

You are working for a data warehouse company that uses Amazon RedShift cluster. It is required that VPC flow logs is used to monitor all COPY and UNLOAD traffic of the cluster that moves in and out of the VPC. Which of the following helps you in this regard ?

  • By using
  • By enabling Enhanced VPC routing on the Amazon Redshift cluster
  • By using Redshift WLM
  • By enabling audit logging in the Redshift cluster
A

By enabling Enhanced VPC routing on the Amazon Redshift cluster

19
Q

You are working for a data warehousing company that has large datasets (20TB of structured data and 20TB of unstructured data). They are planning to host this data in AWS with unstructured data storage on S3. At first they are planning to migrate the data to AWS and use it for basic analytics and are not worried about performance. Which of the following options fulfills their requirement?

  • node type ds2.xlarge
  • node type ds2.8xlarge
  • node type dc2.8xlarge
  • node type dc2.xlarge
A

node type ds2.xlarge (Since they are not worried about performance, storage (ds) is more important than computing power (dc,) and expensive 8xlarge instances aren’t necessary.)

20
Q

Which of the following services allows you to directly run SQL queries against exabytes of unstructured data in Amazon S3?

  • Athena
  • Redshift Spectrum
  • Elasticache
  • RDS
A

Redshift Spectrum

21
Q

How many concurrent queries can you run on a Redshift cluster?

A

50

22
Q

What are some the benefits and use cases of Columnar Databases? (Choose 2)

  • They’re ideal for ‘needle in a haystack’ queries.
  • Compression, as it helps with performance and provides a lower total cost of ownership.
  • They’re ideal for small amounts of data.
  • They store binary objects quite well.
  • They are ideal for Online Analytical Processing (OLAP).
A
  • Compression, as it helps with performance and provides a lower total cost of ownership.
  • They are ideal for Online Analytical Processing (OLAP).

(Compression algorithms supported in Redshift help with performance and also help reduce the amount of data stored in a Redshift cluster, which helps lower the total cost of ownership.)

23
Q

In your current data warehouse, BI analysts consistently join two tables: the customer table and the orders table. The column they JOIN on (and common to both tables) is called customer_id. Both tables are very large, over 1 billion rows. Besides being in charge of migrating the data, you are also responsible for designing the tables in Redshift. Which distribution style would you choose to achieve the best performance when the BI analysts run queries that JOIN the customer table and orders table using customer_id?

A

Key

(The KEY distribution style will help achieve the best performance with in this case. In Redshift, rows are distributed according to the values in one column. The leader node will attempt to place matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns so that matching values from the common columns are physically stored together.)

24
Q

What is the most effective way to merge data into an existing table?

  • Execute an UPSERT.
  • Use a staging table to replace existing rows or update specific rows.
  • UNLOAD data from Redshift into S3, use EMR to ‘merge’ new data files with the unloaded data files, and copy the data into Redshift.
  • Connect the source table and the target Redshift table via a replication tool and run direct INSERTS, UPDATES into the target Redshift table.
A

-Use a staging table to replace existing rows or update specific rows.

(You can efficiently update and insert new data by loading your data into a staging table first. Redshift does not support an UPSERT. To merge data into an existing table, you can perform a merge operation by loading your data into a staging table and then joining the staging table with your target table for an UPDATE statement and an INSERT statement.)

25
Q

What is a fast way to load data into Redshift?

  • By using single-line INSERTS.
  • By using the COPY command.
  • By using multi-line INSERTS.
  • By restoring backup data files into Redshift.
A

By using the COPY command.

(The COPY command is a fast way to load data into Redshift. Single-line INSERTS are slow due to the columnar nature of Redshift. Multi-line INSERTS are better than Single-line INSERTS, but are still not an efficient way to load a large amount of data into Redshift. As Redshift is a managed service, there are no backup data files accessible for file restores.)

26
Q

Which of the following are characteristics of Supervised Learning? (Choose 2)

  • Small amount of data is required to process
  • No algorithm needed to train computational processing
  • Known desired output
  • Data lacks categorization
  • Labeled data
A
  • Known desired output

- Labeled data

27
Q

Which of the following AWS services directly integrate with Redshift using the COPY command. (Choose 3)

  • Kinesis Streams
  • DynamoDB
  • Machine Learning
  • Data Pipeline
  • S3
  • EMR/EC2 instances
A
  • S3
  • EMR/EC2 instances
  • DynamoDB
28
Q

Which of the following is not a function of the Redshift manifest?

  • To load files from different buckets.
  • To load required files only
  • To automatically check files in S3 for data issues.
  • To load files that have a different prefix.
A

-To automatically check files in S3 for data issues.

(A manifest is used to ensure that the COPY command loads required files only. It is also used if you want to load files from different buckets, as well as load files that do not share the same prefix.)

29
Q

Your analytics team runs large, long-running queries in an automated fashion throughout the day. The results of these large queries are then used to make business decisions. However, the analytics team also runs small queries manually on ad-hoc basis. How can you ensure that the large queries do not take up all the resources, preventing the smaller ad-hoc queries from running?

A

Create a query user group for small queries based on the analysts’ Redshift user IDs, and create a second query group for the large, long-running queries.

(Redshift supports workload management (WLM), which enables users to create query groups to assign to query queues so that so that short, fast-running queries won’t get stuck in queues behind long-running queries. Further information: http://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html)

30
Q

An Area Under Curve (AUC) is shown to be 0.5. What does this signify? (Choose 2)

  • Lower AUC numbers would increase confidence.
  • The model is no more accurate than flipping a coin.
  • The AUC provides no value.
  • There is little confidence beyond a guess.
A
  • The model is no more accurate than flipping a coin.
  • There is little confidence beyond a guess.

(A lower AUC reduces accuracy of the prediction; while not perfect, 0.5 is still as accurate as a random guess. AUC values well below 0.5 may indicate a problem with the data.)

31
Q

True or False: When you use the UNLOAD command in Redshift to write data to S3, it automatically creates files using Amazon S3 server-side encryption with AWS-managed encryption keys (SSE-S3).

A

True

(If you want to ensure files are automatically encrypted on S3 with server-side encryption, no special action is needed. The unload command automatically creates files using Amazon S3 server-side encryption with AWS-managed encryption keys (SSE-S3).)

32
Q

Name two types of machine learning that are routinely encountered? (Choose 2)

  • Transcoded Learning
  • Unsupervised Learning
  • Supervised Learning
  • Hypervised Learning
A
  • Unsupervised Learning
  • Supervised Learning

(Supervised Learning is the machine learning task of inferring a function from labeled training data. The training data consists of a set of training examples. Unsupervised Learning is the machine learning task of inferring a function to describe hidden structure from ‘unlabeled data (a classification or categorization is not included in the observations). The remaining answers do not pertain to machine learning.)

33
Q

True or False: Redshift is recommended for transactional processing.

A

False

(Redshift is specifically designed for online analytic processing (OLAP) and business intelligence (BI) applications, which require complex queries against large datasets.)

34
Q

You are trying to predict whether a customer will buy your product. Which machine learning model would help you make this prediction?

  • Binary Classification Model
  • Numeric Prediction Model
  • Multiclass Classification Model
  • Regression Model
A

Binary Classification Model

(You would use the Binary Classification Model to predict a binary outcome. You are looking for a ‘Yes’ or ‘No’ outcome.)

35
Q

You have a table in your Redshift cluster, and the data in this table changes infrequently. The table has fewer than 15 million rows and does not JOIN any other tables. Which distribution style would you select for this table?

A

ALL

(The ALL distribution type is appropriate for tables that change infrequently (tables that are not updated frequently or extensively). With this distribution style, the entire table is distributed to every node.)

36
Q

You are trying to predict a numeric value from inventory/retail data that your company has. Which machine learning model would you use to do this?

A

Regression Model

(The output of a Regression Model is a numeric value for the model’s prediction of the target. For example, it might predict how many units of a product will sell, or it might predict the value of a house.)

37
Q

What does the F1 score represent?

A

The quality of the model

The F1 score’s range is 0 to 1. A larger value indicates better predictive accuracy.

38
Q

True or False: Defining primary keys and foreign keys is an important part of Redshift design because it helps maintain data integrity

A

False

(Redshift does not enforce primary key and foreign key constraints. Even though they are informational only, the query optimizer uses those constraints to generate more efficient query plans.)

39
Q

An administrator decides to use the Amazon Machine Learning service to classify social media posts that mention your company into two categories: posts that require a response and posts that do not. The training dataset of 10,000 posts contains the details of each post, including the timestamp, author, and full text of the post. You are missing the target labels that are required for training.Which two options will create valid target label data?

  • Ask the social media handling team to review each post and provide the label.
  • Use the sentiment analysis NLP library to determine whether a post requires a response.
  • Use the Amazon Mechanical Turk web service to publish Human Intelligence Tasks that ask Turk workers to label the posts.
  • Using the a priori probability distribution of the two classes, use Monte-Carlo simulation to generate the labels.
A
  • Ask the social media handling team to review each post and provide the label.
  • Use the Amazon Mechanical Turk web service to publish Human Intelligence Tasks that ask Turk workers to label the posts.

(You need accurate data to train the service and get accurate results from future data. The options described in B and D would end up training an ML model using the output from a different machine learning model and therefore would significantly increase the possible error rate. It is extremely important to have a very low error rate (if any!) in your training set, and therefore human-validated or assured labels are essential.)