Analytics | Amazon Kinesis Data Firehose Flashcards
What is Amazon Kinesis Data Firehose?
General
Amazon Kinesis Data Firehose | Analytics
Amazon Kinesis Data Firehose is the easiest way to load streaming data into data stores and analytics tools. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk, enabling near real-time analytics with existing business intelligence tools and dashboards you’re already using today. It is a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration. It can also batch, compress, and encrypt the data before loading it, minimizing the amount of storage used at the destination and increasing security.
What does Amazon Kinesis Data Firehose manage on my behalf?
General
Amazon Kinesis Data Firehose | Analytics
Amazon Kinesis Data Firehose manages all underlying infrastructure, storage, networking, and configuration needed to capture and load your data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, or Splunk. You do not have to worry about provisioning, deployment, ongoing maintenance of the hardware, software, or write any other application to manage this process. Firehose also scales elastically without requiring any intervention or associated developer overhead. Moreover, Amazon Kinesis Data Firehose synchronously replicates data across three facilities in an AWS Region, providing high availability and durability for the data as it is transported to the destinations.
How do I use Amazon Kinesis Data Firehose?
General
Amazon Kinesis Data Firehose | Analytics
After you sign up for Amazon Web Services, you can start using Amazon Kinesis Data Firehose with the following steps:
Create an Amazon Kinesis Data Firehose delivery stream through the Firehose Console or the CreateDeliveryStream operation. You can optionally configure an AWS Lambda function in your delivery stream to prepare and transform the raw data before loading the data.
Configure your data producers to continuously send data to your delivery stream using the Amazon Kinesis Agent or the Firehose API.
Firehose automatically and continuously loads your data to the destinations you specify.
What is a source?
General
Amazon Kinesis Data Firehose | Analytics
A source is where your streaming data is continuously generated and captured. For example, a source can be a logging server running on Amazon EC2 instances, an application running on mobile devices, a sensor on an IoT device, or a Kinesis stream.
What are the limits of Amazon Kinesis Data Firehose?
Key Amazon Kinesis Data Firehose Concepts
Amazon Kinesis Data Firehose | Analytics
For information about limits, see Amazon Kinesis Data Firehose Limits in the developer guide.
What is a delivery stream?
Key Amazon Kinesis Data Firehose Concepts
Amazon Kinesis Data Firehose | Analytics
A delivery stream is the underlying entity of Amazon Kinesis Data Firehose. You use Firehose by creating a delivery stream and then sending data to it.
What is a record?
Key Amazon Kinesis Data Firehose Concepts
Amazon Kinesis Data Firehose | Analytics
A record is the data of interest your data producer sends to a delivery stream. The maximum size of a record (before Base64-encoding) is 1000 KB.
What is a destination?
Creating Delivery Streams
Amazon Kinesis Data Firehose | Analytics
A destination is the data store where your data will be delivered. Amazon Kinesis Data Firehose currently supports Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk as destinations.
How do I create a delivery stream?
Creating Delivery Streams
Amazon Kinesis Data Firehose | Analytics
You can create an Amazon Kinesis Data Firehose delivery stream through the Firehose Console or the CreateDeliveryStream operation. For more information, see Creating a Delivery Stream.
What compression format can I use?
Creating Delivery Streams
Amazon Kinesis Data Firehose | Analytics
Amazon Kinesis Data Firehose allows you to compress your data before delivering it to Amazon S3. The service currently supports GZIP, ZIP, and SNAPPY compression formats. Only GZIP is supported if the data is further loaded to Amazon Redshift.
How does compression work when I use the CloudWatch Logs subscription feature?
Creating Delivery Streams
Amazon Kinesis Data Firehose | Analytics
You can use CloudWatch Logs subscription feature to stream data from CloudWatch Logs to Kinesis Data Firehose. All log events from CloudWatch Logs are already compressed in gzip format, so you should keep Firehose’s compression configuration as uncompressed to avoid double-compression. For more information about CloudWatch Logs subscription feature, see Subscription Filters with Amazon Kinesis Data Firehose in the Amazon CloudWatch Logs user guide.
What kind of encryption can I use?
Creating Delivery Streams
Amazon Kinesis Data Firehose | Analytics
Amazon Kinesis Data Firehose allows you to encrypt your data after it’s delivered to your Amazon S3 bucket. While creating your delivery stream, you can choose to encrypt your data with an AWS Key Management Service (KMS) key that you own. For more information about KMS, see AWS Key Management Service.
What is data transformation with Lambda?
Creating Delivery Streams
Amazon Kinesis Data Firehose | Analytics
Firehose can invoke an AWS Lambda function to transform incoming data before delivering it to destinations. You can configure a new Lambda function using one of the Lambda blueprints we provide or choose an existing Lambda function.
What is source record backup?
Creating Delivery Streams
Amazon Kinesis Data Firehose | Analytics
If you use data transformation with Lambda, you can enable source record backup, and Amazon Kinesis Data Firehose will deliver the un-transformed incoming data to a separate S3 bucket. You can specify an extra prefix to be added in front of the “YYYY/MM/DD/HH” UTC time prefix generated by Firehose.
What is error logging?
Creating Delivery Streams
Amazon Kinesis Data Firehose | Analytics
If you enable data transformation with Lambda, Firehose can log any Lambda invocation and data delivery errors to Amazon CloudWatch Logs so that you can view the specific error logs if Lambda invocation or data delivery fails. For more information, see Monitoring with Amazon CloudWatch Logs.
What is buffer size and buffer interval?
Creating Delivery Streams
Amazon Kinesis Data Firehose | Analytics
Amazon Kinesis Data Firehose buffers incoming streaming data to a certain size or for a certain period of time before delivering it to destinations. You can configure buffer size and buffer interval while creating your delivery stream. Buffer size is in MBs and ranges from 1MB to 128MB for Amazon S3 destination and 1MB to 100MB for Amazon Elasticsearch Service destination. Buffer interval is in seconds and ranges from 60 seconds to 900 seconds. Please note that in circumstances where data delivery to destination is falling behind data writing to delivery stream, Firehose raises buffer size dynamically to catch up and make sure that all data is delivered to the destination.
How is buffer size applied if I choose to compress my data?
Creating Delivery Streams
Amazon Kinesis Data Firehose | Analytics
Buffer size is applied before compression. As a result, if you choose to compress your data, the size of the objects within your Amazon S3 bucket can be smaller than the buffer size you specify.
What is the IAM role that I need to specify while creating a delivery stream?
Creating Delivery Streams
Amazon Kinesis Data Firehose | Analytics
Amazon Kinesis Data Firehose assumes the IAM role you specify to access resources such as your Amazon S3 bucket and Amazon Elasticsearch domain. For more information, see Controlling Access with Amazon Kinesis Data Firehose in the Amazon Kinesis Data Firehose developer guide.
What privilege is required for the Amazon Redshift user that I need to specify while creating a delivery stream?
Creating Delivery Streams
Amazon Kinesis Data Firehose | Analytics
The Amazon Redshift user needs to have Redshift INSERT privilege for copying data from your Amazon S3 bucket to your Redshift cluster.
What do I need to do if my Amazon Redshift cluster is within a VPC?
Creating Delivery Streams
Amazon Kinesis Data Firehose | Analytics
If your Amazon Redshift cluster is within a VPC, you need to grant Amazon Kinesis Data Firehose access to your Redshift cluster by unblocking Firehose IP addresses from your VPC. Firehose currently uses one CIDR block for each available AWS Region: 52.70.63.192/27 for US East (N. Virginia), 52.89.255.224/27 for US West (Oregon), and 52.19.239.192/27 for EU (Ireland). For information about how to unblock IPs to your VPC, see Grant Firehose Access to an Amazon Redshift Destination in the Amazon Kinesis Data Firehose developer guide.
Why do I need to provide an Amazon S3 bucket while choosing Amazon Redshift as destination?
Creating Delivery Streams
Amazon Kinesis Data Firehose | Analytics
For Amazon Redshift destination, Amazon Kinesis Data Firehose delivers data to your Amazon S3 bucket first and then issues Redshift COPY command to load data from your S3 bucket to your Redshift cluster.
What is index rotation for Amazon Elasticsearch Service destination?
Creating Delivery Streams
Amazon Kinesis Data Firehose | Analytics
Amazon Kinesis Data Firehose can rotate your Amazon Elasticsearch Service index based on a time duration. You can configure this time duration while creating your delivery stream. For more information, see Index Rotation for the Amazon ES Destination in the Amazon Kinesis Data Firehose developer guide.
Why do I need to provide an Amazon S3 bucket when choosing Amazon Elasticsearch Service as destination?
Creating Delivery Streams
Amazon Kinesis Data Firehose | Analytics
When loading data into Amazon Elasticsearch Service, Amazon Kinesis Data Firehose can back up all of the data or only the data that failed to deliver. To take advantage of this feature and prevent any data loss, you need to provide a backup Amazon S3 bucket.
Can I change the configurations of my delivery stream after it’s created?
Preparing and Transforming Data in Amazon Kinesis Data Firehose
Amazon Kinesis Data Firehose | Analytics
You can change the configuration of your delivery stream at any time after it’s created. You can do so by using the Firehose Console or the UpdateDestination operation. Your delivery stream remains in ACTIVE state while your configurations are updated and you can continue to send data to your delivery stream. The updated configurations normally take effect within a few minutes.
How do I prepare and transform raw data in Amazon Kinesis Data Firehose?
Preparing and Transforming Data in Amazon Kinesis Data Firehose
Amazon Kinesis Data Firehose | Analytics
Amazon Kinesis Data Firehose allows you to use an AWS Lambda function to prepare and transform incoming raw data in your delivery stream before loading it to destinations. You can configure an AWS Lambda function for data transformation when you create a new delivery stream or when you edit an existing delivery stream.
How do I return prepared and transformed data from my AWS Lambda function back to Amazon Kinesis Data Firehose?
Preparing and Transforming Data in Amazon Kinesis Data Firehose
Amazon Kinesis Data Firehose | Analytics
All transformed records from Lambda must be returned to Firehose with the following three parameters; otherwise, Firehose will reject the records and treat them as data transformation failure.
recordId: Firehose passes a recordId along with each record to Lambda during the invocation. Each transformed record should be returned with the exact same recordId. Any mismatch between the original recordId and returned recordId will be treated as data transformation failure.
result: The status of transformation result of each record. The following values are allowed for this parameter: “Ok” if the record is transformed successfully as expected. “Dropped” if your processing logic intentionally drops the record as expected. “ProcessingFailed” if the record is not able to be transformed as expected. Firehose treats returned records with “Ok” and “Dropped” statuses as successfully processed records, and the ones with “ProcessingFailed” status as unsuccessfully processed records when it generates SucceedProcessing.Records and SucceedProcessing.Bytes metrics.
data: The transformed data payload after based64 encoding.
What Lambda blueprints are available for data preparation and transformation?
Preparing and Transforming Data in Amazon Kinesis Data Firehose
Amazon Kinesis Data Firehose | Analytics
Firehose provides the following Lambda blueprints that you can use to create your Lambda function for data transformation:
General Firehose Processing: This blueprint contains the data transformation and status model described above. Use this blueprint for any custom transformation logic.
Apache Log to JSON: This blueprint parses and converts Apache log lines into JSON objects, with predefined JSON field names.
Apache Log to CSV: This blueprint parses and converts Apache log lines into CSV format.
Syslog to JSON: This blueprint parses and converts Syslog lines into JSON objects, with predefined JSON field names.
Syslog to CSV: This blueprint parses and converts Syslog lines into CSV format.
Can I keep a copy of all the raw data in my S3 bucket?
Adding Data to Delivery Streams
Amazon Kinesis Data Firehose | Analytics
Yes, Firehose can back up all un-transformed records to your S3 bucket concurrently while delivering transformed records to destination. Source record backup can be enabled when you create or update your delivery stream.