Data Analytics Flashcards
Abbr for ETL
Extract Transform Load
What is AWS alternative to Apache Kafka?
AWS Kinesis
How is Kinesis Streams divided
Shards
Kinesis Streams retention period
- default 24H
- up to 365 Days
Can multiple applications consume the same stream in Kinesis?
YES
How the billing looks like in Kinesis Data Streams?
per shard provisioned
Size of Data Blob in Kinesis Streams
up to 1MB
Kinesis Producer max write
1MB/s or 1000 messages/s PER SHARD
Message received if producer go above provisioned throughput?
ProvisionedThroughputException
Two types of consumers in Kinesis Streams
- Consumer Classic
- Consumer Enhanced Fan-Out
What is Kinesis Agent?
Kinesis Agent is a stand-alone Java software application that offers an easy way to collect and send data to Kinesis Data Streams
What is hot shard in Kinesis Streams?
Some shards in your Kinesis data stream might receive more records than others. This can lead to throttling errors in the stream, resulting in overworked shards, also known as hot shards.
Potential solutions to ProvisionedThroughputExceeded
- retries with backoff
- increase shards (scaling)
- ensure the partition key is optimal
What is Kinesis Producer Library?
Easy to use and highly configurable C++Java library that helps you write to a Kinesis data stream
Two types of API in KPL?
- Synchronous
- Asynchronous
What is the purpose of Batching in Kinesis Producer Library?
decrease throughput and decrease cost
Kinesis Producer Library two types of batching
- Aggregation
- Collection
What might be the effect of increasing RecordMaxBufferTime in KPL?
- additional processing delay
- higher packing efficiencies and better performance
Can KPL be used if the application cannot tolerate additional delay?
NO. SDK should be used
Shard Kinesis Consumer max throughput?
2MB/s
Shard Kinesis Producer max throughput?
1MB/s
When to use Enhanced Kinesis Fan Out Consumers?
- Multiple Consumer applications for the same stream
- Low latency requirement (70ms)
When to use Standard Kinesis Consumers?
- low number of consuming applications (1,2,3) for the same stream
- Can tolerate 200ms latency
- minimize cost
Default limit of consumers when using Enhanced Fan Out Kinesis Consumer
5
Can you perform many resharding operations at the same time
No, only one operations is allowed at a time and it takes a few seconds
AWS Kinesis Firehose destinations
- S3
- Redshift
- Opensearch
- HTTP Endpoint
What’s the minimum latency for non-full batch in Kinesis Firehose?
60s
Is Kinesis Firehose auto-scaled?
YES
Embedded data transformation format in Kinesis Firehose
JSON -> Parquet
Is compression supported by Kinesis firehose
Yes, when the target is S3
Kinesis Firehose payment schema
Pay only for the amount of data going through Firehose
Buffer flushing logic for the Kinesis Firehose
based on time and size rules
Which place can CloudWatch Logs can be streamed to?
- Kinesis Data Streams
- Kinesis Data Firehose
- AWS Lambda
Kinesis Firehose minimum buffer interval
60 seconds
Kinesis Firehose maximum buffer interval
900 seconds
Maximum write capacity on On-demand Kinesis Data Stream
200MB/s and 200.000 record/s
Maximum read capacity on On-demand Kinesis Data Stream
400MB/s per consumer (extra capacity in Enhanced fan out)
Command to restart kinesis agent on linux
sudo service aws-kinesis-agent restart
Processing capacity of SQS
1 message/s to 10.000 messages/s
How many messages can be in SQS queue?
No limit
What’s the latency of SQS
<10ms on publish and receive
max message size in SQS
256KB
How to send messages over 256KB in SQS
use SQS Extended Client (Java Library)
What can be a content of SQS message
XML, JSON, Unformatted text
Max size of Batch request in SQS
10 messages - max 256KB
Max transactions per second in Standard SQS queue
unlimited
Max transactions per second in SQS FIF queue
3000 messages/s
Data retention period in SQS
1 minute to 14 days
SQS model pricing
- pay per API Request
- pay per network usage
What’s encrypted in SQS when using SSE?
body only, metadata is NOT encrypted
How many times can data be consumed on Kinesis Data Stream?
Many times
When records are deleted from SQS?
After consumption
When data is deleted from Kinesis Data Stream?
After the retention period
Which AWS service allow replay of data?
Kinesis Data Streams
What is IoT Rules Engine?
It evaluates inbound messages published into AWS IoT, transforms and delivers them to another thing or a cloud based on business rules you define
What is IoT device shadow?
A Device Shadow is a persistent, virtual representation of a device that is managed by a thing resource you create in the AWS IoT registry
The purpose of Device Gateway in IoT
Entry point for IoT devices connecting to AWS
Protocols supported by IoT Device Gateway
MQTT, WebSockets, and HTTP 1.1
What is IoT Message Broker?
The Message Broker is a high throughput pub/sub message broker that securely transmits messages to and from all of your IoT devices and applications with low latency.
How are messages published in IoT Message Broker?
messages are published into topics
Which devices will receive Message Broker message in IoT?
all clients connected to the topic
Purpose of IoT Thing Registry
Organizes the resources associated with each device in the AWS Cloud
3 authentication methods for IoT
- Create X.509 certificate and load them securely into the Things
- AWS SigV4
- Custom tokens with Custom authorizers
How device shadow is represented in IoT?
JSON document
What is IoT Greengrass?
AWS IoT Greengrass provides cloud-based management of application logic that runs on devices
What is DMS?
Database Migration Service - quickly and securely migrate databases to AWS, resilient, self healing
When to use SCT (Schema Conversion Tool) in database migration?
When migrating to different DB engine
Snowball Edge Storage Optimized capacity
80 TB
Snowball Edge Compute Optimized capacity
42 TB
AWS Snowcone capacity
8 TB
Which Snow service has DataSync agent pre-installed
Snowcone only
What is AWS OpsHub?
A software you install on your computer/laptop to manage your Snow Family Device
MSK encryption in-flight between brokers
TLS
MSK encryption in-flight between clients
TLS
MSK EBS encryption
KMS
Three MKS Cloud Watch metric levels
- basic
- enhanced
- topic-level
Message size for MSK
1MB default, up to 10MB
Kafka Topic unit
Topics with Partition
Kafka scaling limitation
Can only add partition to a topic
In-flight encryption options for MSK
PLAINTEXT or TLS In-flight
what is multi-part upload in S3
feature to upload files larger than 5GB
Max object size in S3
5TB
Three Glacier retrieval options
- expedited (1-5 mins)
- standard (3-5 hours)
- bulk (5-12 hours)
Amazon Glacier Deep Archive retrieval options
- standard (12h)
- bulk (48h)
Minimum storage duration for Glacier
90 days
Glacier Deep Archive minimum storage duration
180 days
Two types of replication in S3
- CRR - Cross Region Replication
- SRR - Same Region Replication
What is S3 Byte-Range Fetch
You can use concurrent connections to Amazon S3 to fetch different byte ranges from within the same object. This helps you achieve higher aggregate throughput versus a single whole-object request
Which S3 feature can be used to retrieve partial data of file?
S3 Byte-Range Fetch
DynamoDB maxim size of an item?
400KB
What Write Capacity Unit represent in DynamoDB?
one write/s for an item up to 1KB in size
The logic behind Eventually Consistent Read
If we read just after a write, it’s possible we’ll get unexpected response because of replication
The logic behind Strongly Consistent Read
If we read just after a write, we will get the correct data
What Read Capacity Unit represent in DynamoDB?
one strongly consistent read per second or
two eventually consistent reads per second, for an item up to 4KB in size
Max- partition RCU/WCU in DynamoDB
3000RCU/1000WCU
Max DynamoDB partition size
10GB
Three ways of writing data in DynamoDB
- PutItem
- UpdateItem
- Conditional writes
Two ways of deleting data in DynamoDB
- DeleteItem
- DeleteTable
Max BatchWriteItem capacity in DynamoDB
- up to 25 PutItem/DeleteItem in one call
- up to 16Mb of data
- up to 400KB of data per item
Default read in DynamoDB
Eventually consistent
Max capacity of BatchGetItem in DYnamoDB
- up to 100 items
- up to 16MB of data
On which fields Query operate in DynamoDB?
Partition key and Sort key only
DynamoDB index that must be defined at the table creation time
LSI (Local Secondary Index)
Which DynamoDB index can be modified?
GSI (Global Secondary Index) only
For which index RCU/WCI must be defined?
GSI
What is DynamoDB DAX?
DynamoDB Accelerator - seamless cache, no application re-write
Default DynamoDB DAX cache TTL?
5 minutes
Max number of nodes in DynamoDB DAX cluster?
10 nodes
Retention time of DynamoDB Streams
up to 24H
What is DynamoDB Streams?
Captures a time-ordered sequence of item-level modifications in a DynamoDB table and durably stores the information for up to 24 hours
How to access DynamoDB without Internet?
VPC Endpoints
What are DynamoDB Global Tables?
multi-region, fully replicated, high performance tables
How can you migrate DynamoDB to RDS?
Use DMS (Database Migration Service)
How to store large objects in DynamoDB?
Sore them in S3 and reference them in DynamoDB
Will Redis cache survive reboot?
Yes - by default
You would like to react in real-time to users de-activating their account and send them an email to try to bring them back. The best way of doing it is to…
Integrate Lambda with DynamoDB
You would like to have DynamoDB automatically delete old data for you. What should you use?
TTL
You are looking to improve the performance of your RDS database by caching some of the most common rows and queries. Which technology do you recommend?
ElastiCache
How Glue Crawler extract partitions?
Extraction is based on how your S3 data is organized
What are the targets of Glue ETL?
- S3
- JDBC (RDS, Redshift)
- Glue Data Catalog
Which platform is Glue ETL running on?
Serverless Spark platform
Three ways of running Glue jobs
- time based schedules
- job bookmarks
- CloudWatch Events
How Glue job which prevent reprocessing of old data?
Job Bookmark
Glue cost model
Billing by the minute for Crawler and ETL jobs
First million objects stored and accesses are free for the Glue Data Catalog
Development endpoint for developing ETL code charged by the minute
Does Glue ETL support streaming ETL?
yes, runs on Apache Spark Structured Streaming (serverless)
What is the simplest way to make sure the metadata under Glue Data Catalog is always up-to-date and in-sync with the underlying data without your intervention each time?
Schedule crawlers to run periodically
Which programming languages can be used to write ETL code for AWS Glue?
Python and Scala
Can you run existing ETL jobs with AWS Glue?
YES
Upload the code to Amazon S3 and create one or more jobs that use that code. You can reuse the same code across multiple jobs by pointing them to the same code location on Amazon S3.
How can you be notified of the execution of AWS Glue jobs?
CloudWatch + SNS
What is AWS Glue Studio
Visual interface for ETL workflows
What is AWS Glue DataBrew?
A visual data preparation tool
Three types of nodes in EMR
- master
- core
- task
What HDFS stand for?
Hadoop Distributed Files System
How are files stored in HDFS?
files are stored as blocks (128MB default size)
What is EMRFS in AWS?
The EMR File System (EMRFS) is an implementation of HDFS that all Amazon EMR clusters use for reading and writing regular files from Amazon EMR directly to Amazon S3
What happen when you manually detach an EBS volume in EMR?
EMR treats that as a failure and replaces it.
What local storage is suitable for in EMR?
buffers, caches, etc.
EMR charging schema
per hour plus EC2 charges
What happen when the core node fail in EMR?
provisions a new node automatically
How to increase processing capacity but not HDFS capacity in EMR?
Add more task nodes
How to increase both processing and HDFS capacity in EMR?
Resize or add core nodes
Scale-Up strategy in EMR
first add core nodes, then task nodes, up to max units specified
Scale-Down strategy in EMR
- first removes task nodes, then core nodes, no further than minim constraints
- spot nodes always removed before on-demand instances
What YARN stand for?
Yet Another Resource Navigator
What is Apache Spark?
Open-source distributed processing framework for big data
Which languages are supported by Apache Spark?
Java, Scala, Python and R
What is Apache Tez?
Apache Tez is an open-source framework for big data processing based on MapReduce technology
What is Apache Pig?
Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java.
What is HBase?
non-relational, petabyte-scale database based on Google’s BigTable, on top of HDFS
What Presto is used for?
- it connect to many different “big data” databases and data stores at once, and query across them
- interactive queries at petabyte scale
What’s under the hood of AWS Athena?
Presto
What Apache Zeppelin is used for?
Apache Zeppelin is a new and incubating multi-purposed web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop and Spark
What is Hue?
Graphical front-end for applications on EMR cluster
What’s the usage of Splunk?
operational tool - can be used to visualize EMR and S3 data using EMR Hadoop Cluster
What’s the usage of Flume?
Another way to stream data into cluster. Originally made to handle log aggregation.
What is MXNet?
Like tensorflow, a library for building and accelerating neural networks.
What is S3DistCP?
Tool for copying large amounts of data (s3 HDFS)
Which Amazon EMR tool is used for querying multiple data stores at once?
Presto
When you delete your EMR cluster, what happens to the EBS volumes?
EMR will delete the volumes once the EMR cluster is terminated
What’s under the hood of Kinesis Data Analytics?
Apache Flink
Is Kinesis Analytics serverless?
YES
Is Kinesis Analytics scaled automatically?
YES
What is the usage of RANDOM_CUT_FOREST in Kinesis Analytics?
SQL function used for anomaly detection on numeric columns in a stream
As recommended by AWS, you are going to ensure you have dedicated master nodes for high performance. As a user, what can you configure for the master nodes?
The count and instance types of master nodes
Which are supported ways to import data into your Amazon ES domain?
- Kinesis
- Logstash
- Elasticsearch’s API’s
What can you do to prevent data loss due to nodes within your ES domain failing?
Elasticsearch snapshots
Athena cost model
Pay-as-you-go
- $5 per TB scanned
- Successful or cancelled queries count
What OLAP stand for?
On-Line Analytical Processing
Is redshift designed for OLAP or OLTP?
OLAP
Max number of Compute Nodes in Redshift
128 Nodes
What VACUUM command is used for in Redshift?
Recovers space from deleted rows
What is Redshift Elastic resize?
- quickly add or remove nodes of the same type
- cluster is down for a few minutes
- tries to keep connections open across the downtime
What is Redshift Classic resize?
- change node type and/or number of nodes
- cluster is read-only for hours to days
Max number of read replicas in AWS Aurora
15 read replicas
Max number of storage in Amazon Aurora
Up to 64TB per database instance
What is GraphQL used for?
GraphQL is designed to make APIs fast, flexible, and developer-friendly. It can even be deployed within an integrated development environment (IDE) known as GraphiQL. As an alternative to REST.
What is Amazon Kendra used for?
Amazon Kendra is a highly accurate intelligent search service that enables your users to search unstructured data using natural language. It returns specific answers to questions, giving users an experience that’s close to interacting with a human expert.
You have an S3 bucket that your entire organization can read. For security reasons you would like the data sits encrypted there and you would like to define a strategy in which users can only read the data which they are allowed to decrypt, which may be a different partial set of objects within the bucket for each user. How can you achieve that?
Use SSE-KMS to encrypt the files
SSE-KMS will allow you to use different KMS keys to encrypt the objects, and then you can grant users access to specific sets of KMS keys to give them access to the objects in S3 they should be able to decrypt
An application processes sensor data in real-time by publishing it to Kinesis Data Streams, which in turn sends data to an AWS Lambda function that processes it and feeds it to DynamoDB. During peak usage periods, it’s been observed that some data is lost. You’ve determined that you have sufficient capacity allocated for Kinesis shards and DynamoDB reads and writes. What might be TWO possible solutions to the problem?
- Increase your Lambda function’s timeout value
- Process data in smaller batches to avoid hitting Lambda’s timeout
As part of your application development, you would like your users to be able to get Row Level Security. The application is to be deployed on web servers and the users of the application should be able to use their amazon.com accounts. What do you recommend for the database and security?
Enable Web Identity federation. Use DynamoDB and reference ${www.amazon.com:user_id} in the attached IAM policy
What SSE security mechanisms are supported by EMR?
SSE-S3 – Amazon S3 manages keys for you.
SSE-KMS – You use an AWS KMS key to set up with policies suitable for Amazon EMR
Is SSE-CMK available for use in EMR?
NO
Two EMR EBS encryption options
- EBS encryption - available only when you specify AWS Key Management Service as your key provider.
- LUKS encryption – If you choose to use LUKS encryption for Amazon EBS volumes, the LUKS encryption applies only to attached storage volumes, not to the root device volume.
You are working for an e-commerce website and that website uses an on-premise PostgreSQL database as its main OLTP engine. You would like to perform analytical queries on it, but the Solutions Architect recommended not doing it off of the main database. What do you recommend?
Use DMS to replicate the database to RDS
You are processing data using a long running EMR cluster and you like to ensure that you can recover data in case an entire availability zone goes down, as well as process the data locally for the various Hive jobs you plan on running. What do you recommend to do this at a minimal cost?
Store the data in S3 and keep a warm copy in HDFS
A financial services company wishes to back up its encrypted data warehouse in Amazon Redshift daily to a different region. What is the simplest solution that preserves encryption in transit and at rest?
Configure Redshift to automatically copy snapshots to another region, using an AWS KMS customer master key in the destination region.
Does Redshift have cross region snapshots?
YES
A company wishes to copy 500GB of data from their Amazon Redshift cluster into an Amazon RDS PostgreSQL database, in order to have both columnar and row-based data stores available. The Redshift cluster will continue to receive large amounts of new data every day that must be kept in sync with the RDS database. What strategy would be most efficient?
Copy data using the dblink function into PostgreSQL tables
What is Ganglia?
Ganglia is the operational dashboard provided with EMR
Your company has data from a variety of sources, including Microsoft Excel spreadsheets stored in S3, log data stored in a S3 data lake, and structured data stored in Redshift. Which is the simplest solution for providing interactive dashboards that span this data?
Use Amazon Quicksight directly on top of the Excel, S3, and Redshift data.
As part of an effort to limit cost and maintain under control the size of your DynamoDB table, your AWS account manager would like to ensure old data is deleted in DynamoDB after 1 month. How can you do so with as little maintenance as possible and without impacting the current read and write operations?
Enable DynamoDB TTL and add a TTL column
You are dealing with PII datasets and would like to leverage Kinesis Data Streams for your pub-sub solution. Regulators imposed the constraint that the data must be encrypted end-to-end using an internal key management system. What do you recommend?
Implement a custom encryption code in the Kinesis Producer Library (KPL)
A manager wishes to make a case for hiring more people in her department, by showing that the number of incoming tasks for her department have grown at a faster rate than other departments over the past year. Which type of graph in Amazon Quicksight would be best suited to illustrate this data?
Area line chart
You are looking to reduce the latency down from your Big Data processing job that operate in Singapore but source data from Virginia. The Big Data job must always operate against the latest version of the data. What do you recommend?
Enable S3 Cross Region Replication
You have an ETL process that collects data from different sources and 3rd party providers and would like to ensure that data is loaded into Redshift once all the parts from all the providers related to one specific jobs have been gathered, which is the process that can happen over the course of one hour to one day. What the least costly way of doing that?
Create an AWS Lambda that responds to S3 upload events and will check if all the parts are there before uploading to Redshift
A financial services company has a large, secure data lake stored in Amazon S3. They wish to analyze this data using a variety of tools, including Apache Hive, Amazon Athena, Amazon Redshift, and Amazon QuickSight.
How should they connect their data and analysis tools in a way that minimizes costs and development work?
Run an AWS Glue Crawler on the data lake to populate a AWS Glue Data Catalog. Share the glue data catalog as a metadata repository between Athena, Redshift, Hive, and QuickSight
You are working for a data warehouse company that uses Amazon RedShift cluster. For security reasons, it is required that VPC flow logs should be analyzed by Athena to monitor all COPY and UNLOAD traffic of the cluster that moves in and out of the VPC. Which of the following helps you in this regard ?
Use Enhanced VPC Routing
A hospital monitoring sensor data from heart monitors wishes to raise immediate alarms if an anomaly in any individual’s heart rate is detected.
Which architecture meets these requirements in a scalable manner?
Publish sensor data into a Kinesis data stream, and create a Kinesis Data Analytics application using RANDOM_CUT_FOREST to detect anomalies. When an anomaly is detected, use a Lambda function to route an alarm to Amazon SNS
A produce export company has multi-dimensional data for all of its shipments, such as the date, price, category, and destination of every shipment. A data analyst wishes to interactively explore this data, applying statistical functions to different rows and columns and sorting them in different ways.
Which QuickSight visualization would be best suited for this?
Pivot table
You are an online retailer and your website is a storefront for millions of products. You have recently run a big sale on one specific electronic and you have encountered Provisioned Throughput Exceptions. You would like to ensure you can properly survive an upcoming sale that will be three times as big. What do you recommend?
DynamoDB DAX
Your daily Spark jobs runs against files created by a Kinesis Firehose pipeline in S3. Due to a low throughput, you observe that each of the many files created by Kinesis Firehose is about 100KB. You would like to optimise your Spark job as best as possible to query the data efficiently. What do you recommend?
Consolidate files on a daily basis using DataPipeline
A data scientist wishes to develop a machine learning model to predict stock prices using Python in a Jupyter Notebook, and use a cluster on AWS to train and tune this model, and to vend predictions from it at large scale.
Which system allows you to do this?
Amazon SageMaker
You are tasked with using Hive on Elastic MapReduce to analyze data that is currently stored in a large relational database.
Which approach could meet this requirement?
Use Apache Sqoop on the EMR cluster to copy the data into HDFS
What is Sqoop?
Sqoop is an open-source system for transferring data between Hadoop and relational databases.
Your exports application hosted on AWS need to process game results immediately in real time and later perform analytics on the same game results in the order they came at the end of business hours. Which of the AWS service will be the best fit for your needs?
Kinesis Data Streams
You wish to use Amazon Redshift Spectrum to analyze data in an Amazon S3 bucket that is in a different account than Redshift Spectrum.
How would you authorize access between Spectrum and S3 across accounts?
Add a policy to the S3 bucket allowing S3 GET and LIST operations for an IAM role for Spectrum on the Redshift account
You need to ETL streaming data from web server logs as it is streamed in, for analysis in Athena. Upon talking to the stakeholders, you’ve determined that the ETL does not strictly need to happen in real-time, but transforming the data within a minute is desirable.
What is a viable solution to this requirement?
Perform any initial ETL you can using Amazon Kinesis, store the data in S3, and trigger a Glue ETL job to complete the transformations needed.
An organization has a large body of web server logs stored on Amazon S3, and wishes to quickly analyze their data using Amazon Athena. Most queries are operational in nature, and are limited to a single day’s logs.
How should the log data be prepared to provide the most performant queries in Athena, and to minimize costs?
Convert the data into Apache Parquet format, compressed with Snappy, stored in a directory structure of year=XXXX/month=XX/day=XX/
You are creating an EMR cluster that will process the data in several MapReduce steps. Currently you are working against the data in S3 using EMRFS, but the network costs are extremely high as the processes write back temporary data to S3 before reading it. You are tasked with optimizing the process and bringing the cost down, what should you do?
Add a preliminary step that will use a S3DistCp command
You are required to maintain a real-time replica of your Amazon Redshift data warehouse across multiple availability zones.
What is one approach toward accomplishing this?
Spin up separate redshift clusters in multiple availability zones, using Amazon Kinesis to simultaneously write data into each cluster. Use Route 53 to direct your analytics tools to the nearest cluster when querying your data.
You work for a gaming company and each game’s data is stored in DynamoDB tables. In order to provide a game search functionality to your users, you need to move that data over to ElasticSearch. How can you achieve it efficiently and as close to real time as possible?
Enable DynamoDB Streams and write a Lambda function
A large news website needs to produce personalized recommendations for articles to its readers, by training a machine learning model on a daily basis using historical click data. The influx of this data is fairly constant, except during major elections when traffic to the site spikes considerably.
Which system would provide the most cost-effective and reliable solution?
Publish click data into Amazon S3 using Kinesis Firehose, and process the data nightly using Apache Spark and MLLib using spot instances in an EMR cluster. Publish the model’s results to DynamoDB for producing recommendations in real-time.
You are working for a bank and your company is regularly uploading 100 MB files to Amazon S3 and analyzed by Athena. It has come to light that recently some of the uploads have been corrupted and made a critical big data job fails. Your company would like a stronger guarantee that uploads are done successfully and that the files have the same content on premise and on S3. It looks to do so at minimal cost. What do you recommend?
Use the S3 ETag and compare to the local MD5 hash
Three modules of SageMaker
- Build
- Train
- Deploy
What limit, if any, is there to the size of your training dataset in Amazon Machine Learning by default?
100GB
Is there a limit to the size of the dataset that you can use for training models with Amazon SageMaker? If so, what is the limit?
No fixed limit
Does Kinesis Stream preserve client ordering?
YES
Can Kinesis Streams data be consumed in parallel?
YES
EMR deployment options
- EC2
- Amazon EKS
- AWS Outposts
Description of Task node in EMR?
A mode with software components that only runs tasks and DOES NOT store data in HDFS.
Task node is optional
What is EMRFS file system in EMR?
Implementation of the Hadoop file system used for reading and writing regular files from Amazon EMR directly to Amazon S3.
What is HDFS in EMR?
Instance store and Amazon Elastic Block Store (Amazon EBS) volume storage is used for HDFS data and for buffers, caches, scratch data, and other temporary content that some applications may “spill” to the local file system
What is Apache Presto?
Presto, also known as PrestoDB, is an open source, distributed SQL query engine that enables fast analytic queries against data of any size.
What is EMR notebook?
Amazon EMR notebooks provide a managed analysis environment based on open-source Jupyter notebooks so that data scientists, analysts, and developers can prepare and visualize data, collaborate with peers, build applications, and perform interactive analysis.
How Redshift traffic is routed when Enhanced VPC routing is not enabled?
Amazon Redshift routes traffic through the internet, including traffic to other services within the AWS network.
What is Apache Airflow?
Apache Airflow is an open-source task scheduler that can be installed on EC2 instances or bootstrapped on primary nodes
What is Amazon MWAA?
The Amazon MWAA is a managed service that reduces the burden of provisioning and ongoing maintenance of Airflow and offers seamless integration with CloudWatch for system metrics and logs. It offers a rich UI and troubleshooting tools and can be used to orchestrate jobs across hybrid environments.
Two types of cluster types used by Amazon EMR
- long-running
- transient
Use cases for long-running EMR cluster
- Spark Streaming or Flink
- online transaction processing (OLTP) workload like Apache HBase
What is EMR transient cluster?
Cluster to be automatically shut down, it shuts down after all the steps complete.
What is Hive Metastore in AWS?
Apache Hive is an open-source data warehouse and analytics package that runs on top of an Apache Hadoop cluster. A Hive metastore contains a description of the table and the underlying data making up its foundation, including the partition names and data types.
Where is Hive Metastore information recorded by default?
In a MySQL database on the master node’s file system.
Patterns to deploy a Hive Metastore on Amazon EMR:
- AWS Glue Data Catalog
- external data store such as Amazon Relational Database Service (Amazon RDS) or Amazon Aurora
What is Apache Ranger
Apache Ranger is an open-source project that provides authorization and audit capabilities for Hadoop and related big data applications like Apache Hive, Apache HBase, and Apache Kafka.
What is S3DistCp in Amazon EMR?
Primary data transfer utility used in Amazon EMR and is an extension of the open-source Apache DistCp and is optimized to work with Amazon S3.
Can extra EBS volumes be added to EMR cluster?
YES
Is AWS Glue using servers?
NO It’s a serverless service
What is AWS Redshift spectrum?
Redshift Spectrum is a feature of Amazon Redshift that allows you to query data stored on Amazon S3 directly and supports nested data types.
What is AWS QuickSight?
Amazon QuickSight allows everyone in your organization to understand your data by asking questions in natural language, exploring through interactive dashboards, or automatically looking for patterns and outliers powered by machine learning.
What is AWS Glue Data Catalog?
Persistent metadata store, you can use this managed service to store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore.