all Flashcards
Q1
A data engineer in a manufacturing(只出現一次) company is designing a data processing platform
that receives a large volume of unstructured data. The data engineer must populate
a well-structured star schema in Amazon
Redshift.
What is the most efficient architecture strategy for this purpose?
A. Transform the unstructured data using Amazon EMR and generate CSV data. COPY
the CSV data into the analysis schema within Redshift.
B. Load the unstructured data into Redshift, and use string parsing functions to
extract structured data for inserting into the analysis schema.
C. When the data is saved to Amazon S3, use S3 Event Notifications and AWS Lambda
to transform the file contents. Insert the data into the analysis schema on Redshift.
D. Normalize the data using an AWS Marketplace ETL tool, persist the results to
Amazon S3, and use AWS Lambda to INSERT the data into Redshift.
A. Transform the unstructured data using Amazon EMR and generate CSV data. COPY
the CSV data into the analysis schema within Redshift.
Q2
A new algorithm has been written in Python to identify SPAM e-mails. The algorithm
analyzes the free text(Analyze free text只出現一次) contained within a sample set of 1 million e-mails stored on
Amazon S3. The algorithm must be scaled across a production dataset of 5 PB, which
also resides in Amazon S3 storage.
Which AWS service strategy is best for this use case?
A. Copy the data into Amazon ElastiCache to perform text analysis on the in-memory
data and export the results of the model into Amazon Machine Learning.
B. Use Amazon EMR to parallelize the text analysis tasks across the cluster using a
streaming program step.
C. Use Amazon Elasticsearch Service to store the text and then use the Python
Elasticsearch Client to run analysis against the text index.
D. Initiate a Python job from AWS Data Pipeline to run directly against the Amazon S3
text files.
B. Use Amazon EMR to parallelize the text analysis tasks across the cluster using a
streaming program step.
Q3
A data engineer chooses Amazon DynamoDB as a data store for a regulated
application. This application must be submitted to regulators for review. The data
engineer needs to provide a control framework that lists the security controls from
the process to follow to add new users down to the physical controls of the data
center, including items like security guards and cameras.
How should this control mapping be achieved using AWS?
A. Request AWS third-party audit reports and/or the AWS quality addendum and
map the AWS responsibilities to the controls that must be provided.
B. Request data center Temporary Auditor access to an AWS data center to verify the
control mapping.
C. Request relevant SLAs and security guidelines for Amazon DynamoDB and define
these guidelines within the applications architecture to map to the control
framework.
D. Request Amazon DynamoDB system architecture designs to determine how to
map the AWS responsibilities to the control that must be provided.
A. Request AWS third-party audit reports and/or the AWS quality addendum and
map the AWS responsibilities to the controls that must be provided.
Q4
An administrator needs to design a distribution strategy for a star schema in a
Redshift cluster. The administrator needs to determine the optimal distribution style
for the tables in the Redshift schema.
In which three circumstances would choosing Key-based distribution be most
appropriate? (Select three.)
A. When the administrator needs to optimize a large, slowly changing dimension
table.
B. When the administrator needs to reduce cross-node traffic.
C. When the administrator needs to optimize the fact table for parity with the
number of slices.
D. When the administrator needs to balance data distribution and collocation data.
E. When the administrator needs to take advantage of data locality on a local node
for joins and aggregates.
B. When the administrator needs to reduce cross-node traffic.
D. When the administrator needs to balance data distribution and collocation data.
E. When the administrator needs to take advantage of data locality on a local node
for joins and aggregates.
Q5
Company A operates in Country X. Company A maintains a large dataset of historical
purchase orders that contains personal data of their customers in the form of full
names and telephone numbers. The dataset consists of 5 text files, 1TB each.
Currently the dataset resides on-premises due to legal requirements of storing
personal data in-country. The research and development department needs to run a
clustering algorithm on the dataset and wants to use Elastic Map Reduce service in
the closest AWS region. Due to geographic distance, the minimum latency between
the on-premises system and the closet AWS region is 200 ms.
Which option allows Company A to do clustering in the AWS Cloud and meet the
legal requirement of maintaining personal data in-country?
A. Anonymize the personal data portions of the dataset and transfer the data files
into Amazon S3 in the AWS region. Have the EMR cluster read the dataset using
EMRFS.
B. Establish a Direct Connect link between the on-premises system and the AWS
region to reduce latency. Have the EMR cluster read the data directly from the
on-premises storage system over Direct Connect.
C. Encrypt the data files according to encryption standards of Country X and store
them on AWS region in Amazon S3. Have the EMR cluster read the dataset using
EMRFS.
D. Use AWS Import/Export Snowball device to securely transfer the data to the AWS
region and copy the files onto an EBS volume. Have the EMR cluster read the dataset
using EMRFS.
A. Anonymize the personal data portions of the dataset and transfer the data files
into Amazon S3 in the AWS region. Have the EMR cluster read the dataset using
EMRFS
Q6
An administrator needs to design a strategy for the schema in a Redshift cluster. The
administrator needs to determine the optimal distribution style for the tables in the
Redshift schema.
In which two circumstances would choosing EVEN distribution be most appropriate?
(Choose two.)
A. When the tables are highly denormalized and do NOT participate in frequent joins.
B. When data must be grouped based on a specific key on a defined slice.
C. When data transfer between nodes must be eliminated.
D. When a new table has been loaded and it is unclear how it will be joined to
dimension.
A. When the tables are highly denormalized and do NOT participate in frequent joins.
D. When a new table has been loaded and it is unclear how it will be joined to
dimension.
Q7
A large grocery distributor receives daily depletion reports from the field in the form
of gzip archives od CSV files uploaded to Amazon S3. The files range from 500MB to
5GB. These files are processed daily by an EMR job.
Recently it has been observed that the file sizes vary, and the EMR jobs take too long.
The distributor needs to tune and optimize the data processing workflow with this
limited information to improve the performance of the
EMR job.
Which recommendation should an administrator provide?
A. Reduce the HDFS block size to increase the number of task processors.
B. Use bzip2 or Snappy rather than gzip for the archives.
C. Decompress the gzip archives and store the data as CSV files.
D. Use Avro rather than gzip for the archives.
B. Use bzip2 or Snappy rather than gzip for the archives.
Q8
A web-hosting company is building a web analytics tool to capture clickstream data
from all of the websites hosted within its platform and to provide near-real-time
business intelligence. This entire system is built on
AWS services. The web-hosting company is interested in using Amazon Kinesis to
collect this data and perform sliding window analytics.
What is the most reliable and fault-tolerant technique to get each website to send
data to Amazon Kinesis with every click?
A. After receiving a request, each web server sends it to Amazon Kinesis using the
Amazon Kinesis PutRecord API. Use the sessionID as a partition key and set up a loop
to retry until a success response is received.
B. After receiving a request, each web server sends it to Amazon Kinesis using the
Amazon Kinesis Producer Library .addRecords method.
C. Each web server buffers the requests until the count reaches 500 and sends them
to Amazon Kinesis using the Amazon Kinesis PutRecord API.
D. After receiving a request, each web server sends it to Amazon Kinesis using the
Amazon Kinesis PutRecord API. Use the exponential back-off algorithm for retries
until a successful response is received.
A. After receiving a request, each web server sends it to Amazon Kinesis using the
Amazon Kinesis PutRecord API. Use the sessionID as a partition key and set up a loop
to retry until a success response is received.
Q9
A customer has an Amazon S3 bucket. Objects are uploaded simultaneously by a
cluster of servers from multiple streams of data. The customer maintains a catalog of
objects uploaded in Amazon S3 using an
Amazon DynamoDB table. This catalog has the following fileds: StreamName,
TimeStamp, and ServerName, from which ObjectName can be obtained.
The customer needs to define the catalog to support querying for a given stream or
server within a defined time range.
Which DynamoDB table scheme is most efficient to support these queries?
A. Define a Primary Key with ServerName as Partition Key and TimeStamp as Sort Key.
Do NOT define a Local Secondary Index or Global Secondary Index.
B. Define a Primary Key with StreamName as Partition Key and TimeStamp followed
by ServerName as Sort Key. Define a Global Secondary Index with ServerName as
partition key and TimeStamp followed by StreamName.
C. Define a Primary Key with ServerName as Partition Key. Define a Local Secondary
Index with StreamName as Partition Key. Define a Global Secondary Index with
TimeStamp as Partition Key.
D. Define a Primary Key with ServerName as Partition Key. Define a Local Secondary
Index with TimeStamp as Partition Key. Define a Global Secondary Index with
StreamName as Partition Key and TimeStamp as Sort Key.
B. Define a Primary Key with StreamName as Partition Key and TimeStamp followed
by ServerName as Sort Key. Define a Global Secondary Index with ServerName as
partition key and TimeStamp followed by StreamName.
Q10
A company has several teams of analysts. Each team of analysts has their own cluster.
The teams need to run
SQL queries using Hive, Spark-SQL, and Presto with Amazon EMR. The company
needs to enable a centralized metadata layer to expose the Amazon S3 objects as
tables to the analysts.
Which approach meets the requirement for a centralized metadata layer?
A. EMRFS consistent view with a common Amazon DynamoDB table
B. Bootstrap action to change the Hive Metastore to an Amazon RDS database
C. s3distcp with the outputManifest option to generate RDS DDL
D. Naming scheme support with automatic partition discovery from Amazon S3
A. EMRFS consistent view with a common Amazon DynamoDB table
Q11
An administrator needs to manage a large catalog of items from various external
sellers. The administrator needs to determine if the items should be identified as
minimally dangerous, dangerous, or highly dangerous based on their textual
descriptions. The administrator already has some items with the danger attribute,
but receives hundreds of new item descriptions every day without such classification.
The administrator has a system that captures dangerous goods reports from
customer support team of from user feedback.
What is a cost-effective architecture to solve this issue?
A. Build a set of regular expression rules that are based on the existing examples, and
run them on the DynamoDB Streams as every new item description is added to the
system.
B. Build a Kinesis Streams process that captures and marks the relevant items in the
dangerous goods reports using a Lambda function once more than two reports have
been filed.
C. Build a machine learning model to properly classify dangerous goods and run it on
the DynamoDB Streams as every new item description is added to the system.
D. Build a machine learning model with binary classification for dangerous goods and
run it on the DynamoDB Streams as every new item description is added to the
system.
C. Build a machine learning model to properly classify dangerous goods and run it on
the DynamoDB Streams as every new item description
Q12
A company receives data sets coming from external providers on Amazon S3. Data
sets from different providers are dependent on one another. Data sets will arrive at
different times and in no particular order.
A data architect needs to design a solution that enables the company to do the
following:
✑ Rapidly perform cross data set analysis as soon as the data becomes available
✑ Manage dependencies between data sets that arrive at different times
Which architecture strategy offers a scalable and cost-effective solution that meets
these requirements?
A. Maintain data dependency information in Amazon RDS for MySQL. Use an AWS
Data Pipeline job to load an Amazon EMR Hive table based on task dependencies and
event notification triggers in Amazon S3.
B. Maintain data dependency information in an Amazon DynamoDB table. Use
Amazon SNS and event notifications to publish data to fleet of Amazon EC2 workers.
Once the task dependencies have been resolved, process the data with Amazon
EMR.
C. Maintain data dependency information in an Amazon ElastiCache Redis cluster.
Use Amazon S3 event notifications to trigger an AWS Lambda function that maps the
S3 object to Redis. Once the task dependencies have been resolved, process the data
with Amazon EMR.
D. Maintain data dependency information in an Amazon DynamoDB table. Use
Amazon S3 event notifications to trigger an AWS Lambda function that maps the S3
object to the task associated with it in DynamoDB. Once all task dependencies have
been resolved, process the data with Amazon EMR.
D. Maintain data dependency information in an Amazon DynamoDB table. Use
Amazon S3 event notifications to trigger an AWS Lambda function that maps the S3
object to the task associated with it in DynamoDB. Once all task dependencies have
been resolved, process the data with Amazon EMR.
Q13
A media advertising company handles a large number of real-time messages sourced
from over 200 websites in real time. Processing latency must be kept low. Based on
calculations, a 60-shard Amazon Kinesis stream is more than sufficient to handle the
maximum data throughput, even with traffic spikes. The company also uses an
Amazon Kinesis Client Library (KCL) (KCL 唯一出現在題目裡面) application running on Amazon Elastic Compute
Cloud (EC2) managed by an Auto Scaling group. Amazon CloudWatch indicates an
average of 25% CPU and a modest level of network traffic across all running servers.
The company reports a 150% to 200% increase in latency of processing messages
from Amazon Kinesis during peak times. There are NO reports of delay from the sites
publishing to Amazon Kinesis.
What is the appropriate solution to address the latency?
A. Increase the number of shards in the Amazon Kinesis stream to 80 for greater
concurrency.
B. Increase the size of the Amazon EC2 instances to increase network throughput.
C. Increase the minimum number of instances in the Auto Scaling group.
D. Increase Amazon DynamoDB throughput on the checkpoint table.
D. Increase Amazon DynamoDB throughput on the checkpoint table.
Q14
A Redshift data warehouse(redshift data warehouse 唯一出現的一次) has different user teams that need to query the same
table with very different query types. These user teams are experiencing poor
performance.
Which action improves performance for the user teams in this situation?
A. Create custom table views.
B. Add interleaved sort keys per team.
C. Maintain team-specific copies of the table.
D. Add support for workload management queue hopping.
B. Add interleaved sort keys per team.
Q15
A company operates an international business served from a single AWS region. The
company wants to expand into a new country(唯一出現 在題目new country). The regulator for that country requires
the Data Architect to maintain a log of financial transactions in the country within 24
hours of the product transaction. The production application is latency insensitive.
The new country contains another AWS region.
What is the most cost-effective way to meet this requirement?
A. Use CloudFormation to replicate the production application to the new region.
B. Use Amazon CloudFront to serve application content locally in the country;
Amazon CloudFront logs will satisfy the requirement.
C. Continue to serve customers from the existing region while using Amazon Kinesis
to stream transaction data to the regulator.
D. Use Amazon S3 cross-region replication to copy and persist production transaction
logs to a bucket in the new countrys region.
D. Use Amazon S3 cross-region replication to copy and persist production transaction
logs to a bucket in the new countrys region.
Q16
An administrator needs to design the event log storage architecture for events from
mobile devices. The event data will be processed by an Amazon EMR cluster daily for
aggregated reporting and analytics before being archived.
How should the administrator recommend storing the log data?
A. Create an Amazon S3 bucket and write log data into folders by device. Execute the
EMR job on the device folders.
B. Create an Amazon DynamoDB table partitioned on the device and sorted on date,
write log data to table. Execute the EMR job on the Amazon DynamoDB table.
C. Create an Amazon S3 bucket and write data into folders by day. Execute the EMR
job on the daily folder.
D. Create an Amazon DynamoDB table partitioned on EventID, write log data to table.
Execute the EMR job on the table.
C. Create an Amazon S3 bucket and write data into folders by day. Execute the EMR
job on the daily folder.
Q17
A data engineer wants to use an Amazon Elastic Map Reduce for an application. The
data engineer needs to make sure it complies with regulatory requirements. The
auditor must be able to confirm at any point which servers are running and which
network access controls are deployed.
Which action should the data engineer take to meet this requirement?
A. Provide the auditor IAM accounts with the SecurityAudit policy attached to their
group.
B. Provide the auditor with SSH keys for access to the Amazon EMR cluster.
C. Provide the auditor with CloudFormation templates.
D. Provide the auditor with access to AWS DirectConnect to use their existing tools.
A. Provide the auditor IAM accounts with the SecurityAudit policy attached to their
group.(When you use SSH with AWS, you are connecting to an EC2 instance, which is a virtual server running in the cloud. When working with Amazon EMR, the most common use of SSH is to connect to the EC2 instance that is acting as the master node of the cluster.)
Q18
A social media customer has data from different data sources including RDS running
MySQL, Redshift, and
Hive on EMR. To support better analysis, the customer needs to be able to analyze
data from different data sources and to combine the results.
What is the most cost-effective solution to meet these requirements?
A. Load all data from a different database/warehouse to S3. Use Redshift COPY
command to copy data to Redshift for analysis.
B. Install Presto on the EMR cluster where Hive sits. Configure MySQL and
PostgreSQL connector to select from different data sources in a single query.
C. Spin up an Elasticsearch cluster. Load data from all three data sources and use
Kibana to analyze.
D. Write a program running on a separate EC2 instance to run queries to three
different systems. Aggregate the results after getting the responses from all three
systems.
B. Install Presto on the EMR cluster where Hive sits. Configure MySQL and
PostgreSQL connector to select from different data sources in a single query.
Q19
An Amazon EMR cluster using EMRFS has access to petabytes of data on Amazon S3,
originating from multiple unique data sources. The customer needs to query
common fields across some of the data sets to be able to perform interactive joins
and then display results quickly.
Which technology is most appropriate to enable this capability?
A. Presto
B. MicroStrategy
C. Pig
D. R Studio
A. Presto
(Presto 是一種開放原始碼的分散式 SQL 查詢引擎,已針對低延遲和資料臨機操作分析進行優化。它支援 ANSI SQL 標準,包含複雜查詢、彙總、加入 (join) 和視窗函數。Presto 可以處理來自多個資料來源的資料,而這些資料來源包含 Hadoop 分散式檔案系統 (HDFS) 和 Amazon S3。您可在這裡進一步了解有關 Presto 的資訊。)
Q20
A game company needs to properly scale its game application, which is backed by
DynamoDB. Amazon
Redshift has the past two years of historical data. Game traffic varies throughout the
year based on various factors such as season, movie release, and holiday season. An
administrator needs to calculate how much read and write throughput should be
provisioned for DynamoDB table for each week in advance.
How should the administrator accomplish this task?
A. Feed the data into Amazon Machine Learning and build a regression model.
B. Feed the data into Spark Mlib and build a random forest modest.
C. Feed the data into Apache Mahout and build a multi-classification model.
D. Feed the data into Amazon Machine Learning and build a binary classification
model.
A. Feed the data into Amazon Machine Learning and build a regression model.
Q21
A data engineer is about to perform a major upgrade to the DDL(schema 建構語法) contained within an
Amazon Redshift cluster to support a new data warehouse application. The upgrade
scripts will include user permission updates, view and table structure changes as well
as additional loading and data manipulation tasks.
The data engineer must be able to restore the database to its existing state in the
event of issues.
Which action should be taken prior to performing this upgrade task?
A. Run an UNLOAD command for all data in the warehouse and save it to S3.
B. Create a manual snapshot of the Amazon Redshift cluster.
C. Make a copy of the automated snapshot on the Amazon Redshift cluster.
D. Call the waitForSnapshotAvailable command from either the AWS CLI or an AWS
SDK.
B. Create a manual snapshot of the Amazon Redshift cluster.
Q22
A large oil and gas company needs to provide near real-time alerts when peak
thresholds are exceeded in its pipeline system. The company has developed a system
to capture pipeline metrics such as flow rate, pressure, and temperature using
millions of sensors. The sensors deliver to AWS IoT.
What is a cost-effective way to provide near real-time alerts on the pipeline metrics?
A. Create an AWS IoT rule to generate an Amazon SNS notification.
B. Store the data points in an Amazon DynamoDB table and poll if for peak metrics
data from an Amazon EC2 application.
C. Create an Amazon Machine Learning model and invoke it with AWS Lambda.
D. Use Amazon Kinesis Streams and a KCL-based application deployed on AWS Elastic
Beanstalk.
A. Create an AWS IoT rule to generate an Amazon SNS notification.
A company is using Amazon Machine Learning as part of a medical software
application. The application will predict the most likely blood type for a patient based
on a variety of other clinical tests that are available when blood type knowledge is
unavailable.
What is the appropriate model choice and target attribute combination for this
problem?
A. Multi-class classification model with a categorical target attribute.
B. Regression model with a numeric target attribute.
C. Binary Classification with a categorical target attribute.
D. K-Nearest Neighbors model with a multi-class target attribute.
A. Multi-class classification model with a categorical target attribute.
Q24
A data engineer is running a DWH on a 25-node Redshift cluster of a SaaS service.
The data engineer needs to build a dashboard that will be used by customers. Five
big customers represent 80% of usage, and there is a long tail of dozens of smaller
customers. The data engineer has selected the dashboarding tool.
How should the data engineer make sure that the larger customer workloads do NOT
interfere with the smaller customer workloads?
A. Apply query filters based on customer-id that can NOT be changed by the user and
apply distribution keys on customer-id.
B. Place the largest customers into a single user group with a dedicated query queue
and place the rest of the customers into a different query queue.
C. Push aggregations into an RDS for Aurora instance. Connect the dashboard
application to Aurora rather than Redshift for faster queries.
D. Route the largest customers to a dedicated Redshift cluster. Raise the concurrency
of the multi-tenant Redshift cluster to accommodate the remaining customers.
B. Place the largest customers into a single user group with a dedicated query queue
and place the rest of the customers into a different query queue.
Q25
An Amazon Kinesis stream needs to be encrypted.
Which approach should be used to accomplish this task?
A. Perform a client-side encryption of the data before it enters the Amazon Kinesis
stream on the producer.
B. Use a partition key to segment the data by MD5 hash function, which makes it
undecipherable while in transit.
C. Perform a client-side encryption of the data before it enters the Amazon Kinesis
stream on the consumer.
D. Use a shard to segment the data, which has built-in functionality to make it
indecipherable while in transit.
A. Perform a client-side encryption of the data before it enters the Amazon Kinesis
stream on the producer.
(Through the use of HTTPS, Amazon Kinesis Streams encrypts data in-flight between clients which protects against someone eavesdropping on records being transferred. )
Q26
An online photo album app has a key design feature to support multiple screens (e.g,
desktop, mobile phone, and tablet) with high-quality displays. Multiple versions of
the image must be saved in different resolutions and layouts.
The image-processing Java program takes an average of five seconds per upload,
depending on the image size and format. Each image upload captures the following
image metadata: user, album, photo label, upload timestamp.
The app should support the following requirements:
✑ Hundreds of user image uploads per second
✑ Maximum image upload size of 10 MB
✑ Maximum image metadata size of 1 KB
✑ Image displayed in optimized resolution in all supported screens no later than one
minute after image upload
Which strategy should be used to meet these requirements?
A. Write images and metadata to Amazon Kinesis. Use a Kinesis Client Library (KCL)
application to run the image processing and save the image output to Amazon S3 and
metadata to the app repository DB.
B. Write image and metadata RDS with BLOB data type. Use AWS Data Pipeline to run
the image processing and save the image output to Amazon S3 and metadata to the
app repository DB.
C. Upload image with metadata to Amazon S3, use Lambda function to run the image
processing and save the images output to Amazon S3 and metadata to the app
repository DB.
D. Write image and metadata to Amazon Kinesis. Use Amazon Elastic MapReduce
(EMR) with Spark Streaming to run image processing and save the images output to
Amazon S3 and metadata to app repository DB.
C. Upload image with metadata to Amazon S3, use Lambda function to run the image
processing and save the images output to Amazon S3 and metadata to the app
repository DB.
(inesis is for real-time streaming. You need to deal with Shards. Single shard has 1Mb limit. This question is about image upload and transformation. Maximum image size is 10Gb. It is easier to deal with S3 for image upload, lambda for transformation and put the metadata into something like DynamoDB. So, I think the answer is C.)
Q27
A customer needs to determine the optimal distribution strategy for the ORDERS fact
table in its Redshift schema. The ORDERS table has foreign key relationships with
multiple dimension tables in this schema.
How should the company determine the most appropriate distribution key for the
ORDERS table?
A. Identify the largest and most frequently joined dimension table and ensure that it
and the ORDERS table both have EVEN distribution.
B. Identify the largest dimension table and designate the key of this dimension table
as the distribution key of the ORDERS table.
C. Identify the smallest dimension table and designate the key of this dimension
table as the distribution key of the ORDERS table.
D. Identify the largest and the most frequently joined dimension table and designate
the key of this dimension table as the distribution key of the ORDERS table.
D. Identify the largest and the most frequently joined dimension table and designate
the key of this dimension table as the distribution key of the ORDERS table.
Q28
A customer is collecting clickstream data using Amazon Kinesis and is grouping the
events by IP address into
5-minute chunks stored in Amazon S3.
Many analysts in the company use Hive on Amazon EMR to analyze this data. Their
queries always reference a single IP address. Data must be optimized for querying
based on IP address using Hive running on Amazon
EMR.
What is the most efficient method to query the data with Hive?
A. Store an index of the files by IP address in the Amazon DynamoDB metadata store
for EMRFS.
B. Store the Amazon S3 objects with the following naming scheme:
bucket_name/source=ip_address/ year=yy/month=mm/day=dd/hour=hh/filename.
C. Store the data in an HBase table with the IP address as the row key.
D. Store the events for an IP address as a single file in Amazon S3 and add metadata
with keys: Hive_Partitioned_IPAddress.
B. Store the Amazon S3 objects with the following naming scheme:
bucket_name/source=ip_address/ year=yy/month=mm/day=dd/hour=hh/filename.
Q29
An online retailer is using Amazon DynamoDB to store data related to customer
transactions. The items in the table contains several string attributes describing the
transaction as well as a JSON attribute containing the shopping cart and other details
corresponding to the transaction. Average item size is 250KB, most of which is
associated with the JSON attribute. The average customer generates 3GB of data per
month.
Customers access the table to display their transaction history and review
transaction details as needed.
Ninety percent of the queries against the table are executed when building the
transaction history view, with the other 10% retrieving transaction details. The table
is partitioned on CustomerID and sorted on transaction date.
The client has very high read capacity provisioned for the table and experiences very
even utilization, but complains about the cost of Amazon DynamoDB compared to
other NoSQL solutions.
Which strategy will reduce the cost associated with the clients read queries while not
degrading quality?
A. Modify all database calls to use eventually consistent reads and advise customers
that transaction history may be one second out-of-date.
B. Change the primary table to partition on TransactionID, create a GSI partitioned on
customer and sorted on date, project small attributes into GSI, and then query GSI
for summary data and the primary table for JSON details.
C. Vertically partition the table, store base attributes on the primary table, and create
a foreign key reference to a secondary table containing the JSON data. Query the
primary table for summary data and the secondary table for JSON details.
D. Create an LSI sorted on date, project the JSON attribute into the index, and then
query the primary table for summary data and the LSI for JSON details.
A. Modify all database calls to use eventually consistent reads and advise customers
that transaction history may be one second out-of-date.
Q31
An organization needs a data store to handle the following data types and access
patterns:
✑ Faceting
✑ Search
✑ Flexible schema (JSON) and fixed schema
✑ Noise word elimination
Which data store should the organization choose?
A. Amazon Relational Database Service (RDS)
B. Amazon Redshift
C. Amazon DynamoDB
D. Amazon Elasticsearch Service
D. Amazon Elasticsearch Service
Q32
A travel website needs to present a graphical quantitative summary of its daily
bookings to website visitors for marketing purposes. The website has millions of
visitors per day, but wants to control costs by implementing the least-expensive
solution for this visualization.
What is the most cost-effective solution?
A. Generate a static graph with a transient EMR cluster daily, and store it an Amazon
S3.
B. Generate a graph using MicroStrategy backed by a transient EMR cluster.
C. Implement a Jupyter front-end provided by a continuously running EMR cluster
leveraging spot instances for task nodes.
D. Implement a Zeppelin application that runs on a long-running EMR cluster.
A. Generate a static graph with a transient EMR cluster daily, and store it an Amazon
S3.
Q30
A company that manufactures and sells smart air conditioning(冷氣機) units also offers
add-on services so that customers can see real-time dashboards in a mobile
application or a web browser. Each unit sends its sensor information in JSON format
every two seconds for processing and analysis. The company also needs to consume
this data to predict possible equipment problems before they occur. A few thousand
pre-purchased units will be delivered in the next couple of months. The company
expects high market growth in the next year and needs to handle a massive amount
of data and scale without interruption.
Which ingestion solution should the company use?
A. Write sensor data records to Amazon Kinesis Streams. Process the data using KCL
applications for the end-consumer dashboard and anomaly detection workflows.
B. Batch sensor data to Amazon Simple Storage Service (S3) every 15 minutes. Flow
the data downstream to the end-consumer dashboard and to the anomaly detection
application.
C. Write sensor data records to Amazon Kinesis Firehose with Amazon Simple Storage
Service (S3) as the destination. Consume the data with a KCL application for the
end-consumer dashboard and anomaly detection.
D. Write sensor data records to Amazon Relational Database Service (RDS). Build
both the end-consumer dashboard and anomaly detection application on top of
Amazon RDS.
A. Write sensor data records to Amazon Kinesis Streams. Process the data using KCL
applications for the end-consumer dashboard and anomaly detection workflows.