Analytics | Amazon Kinesis Data Analytics Flashcards
What is Amazon Kinesis Data Analytics?
General
Amazon Kinesis Data Analytics | Analytics
Amazon Kinesis Data Analytics is the easiest way to process and analyze real-time, streaming data. With Amazon Kinesis Data Analytics, you just use standard SQL to process your data streams, so you don’t have to learn any new programming languages. Simply point Kinesis Data Analytics at an incoming data stream, write your SQL queries, and specify where you want to load the results. Kinesis Data Analytics takes care of running your SQL queries continuously on data while it’s in transit and sending the results to the destinations.
What is real-time stream processing and why do I need it?
General
Amazon Kinesis Data Analytics | Analytics
Data is coming at us at lightning speeds due to an explosive growth of real-time data sources. Whether it is log data coming from mobile and web applications, purchase data from ecommerce sites, or sensor data from IoT devices, it all delivers information that can help companies learn about what their customers, organization, and business are doing right now. By having visibility into this data as it arrives, you can monitor your business in real-time and quickly leverage new business opportunities – like making promotional offers to customers based on where they might be at a specific time, or monitoring social sentiment and changing customer attitudes to identify and act on new opportunities.
To take advantage of these opportunities, you need a different set of analytics tools for collecting and analyzing real-time streaming data than what has been available traditionally for static, stored data. With traditional analytics, you gather the information, store it in a database, and analyze it hours, days, or weeks later. Analyzing real-time data requires a different approach and different tools and services. Instead of running database queries on stored data, streaming analytics platforms process the data continuously before the data is stored in a database. Streaming data flows at an incredible rate that can vary up and down all the time. Streaming analytics platforms have to be able to process this data when it arrives, often at speeds of millions of events per hour.
What can I do with Kinesis Data Analytics?
General
Amazon Kinesis Data Analytics | Analytics
You can use Kinesis Data Analytics in pretty much any use case where you are collecting data continuously in real-time and want to get information and insights in seconds or minutes rather than having to wait days or even weeks. In particular, Kinesis Data Analytics enables you to quickly build end-to-end stream processing applications for log analytics, clickstream analytics, Internet of Things (IoT), ad tech, gaming, and more. The three most common usage patterns are time-series analytics, real-time dashboards, and real-time alerts and notifications.
Generate Time-Series Analytics
Time-series analytics enables you to monitor and understand how your data is trending over time. With Kinesis Data Analytics, you can author SQL code that continuously generates time-series analytics over time windows. For example, you can build a live leaderboard for a mobile game by computing the top players every minute and then sending it to Amazon S3. Or, you can track the traffic to your website by calculating the number of unique website visitors every five minutes and then sending the processed results to Amazon Redshift.
Feed Real-Time Dashboards
You can build applications that compute query results and emit them to a live dashboard, enabling you to visualize the data in near real-time. For example, an application can continuously calculate business metrics such as the number of purchases from an e-commerce site, grouped by the product category, and then send the results to Amazon Redshift for visualization with a business intelligence tool of your choice. Consider another example where an application processes log data and calculates the number application errors, and then send the results to Amazon Elasticsearch Service for visualization with Kibana.
Create Real-Time Alarms and Notifications
You can build applications that send real-time alarms or notifications when certain metrics reach predefined thresholds, or, in more advanced cases, when your application detects anomalies using the machine learning algorithm we provide. For example, an application can compute the availability or success rate of a customer-facing API over time, and then send results to Amazon CloudWatch. You can build another application to look for events that meet certain criteria, and then automatically notify the right customers using Kinesis Data Streams and Amazon Simple Notification Service (SNS).
How do I get started with Kinesis Data Analytics?
General
Amazon Kinesis Data Analytics | Analytics
Sign into the Kinesis Data Analytics console and create a new stream processing application. You can also use the AWS CLI and AWS SDKs. You can build an end-to-end application in three simple steps: 1) configure incoming streaming data, 2) write your SQL queries, and 3) point to where you want the results loaded. Kinesis Data Analytics recognizes standard data formats such as JSON, CSV, and TSV, and automatically creates a baseline schema. You can refine this schema, or if your data is unstructured, you can define a new one using our intuitive schema editor. Then, the service applies the schema to the input stream and makes it look like a SQL table that is continually updated so that you can write standard SQL queries against it. You use our SQL editor to build your queries. The SQL editor comes with all the bells and whistles including syntax checking and testing against live data. We also give you templates that provide the SQL code for anything from a simple stream filter to advanced anomaly detection and top-K analysis. Kinesis Data Analytics takes care of provisioning and elastically scaling all of the infrastructure to handle any data throughput. You don’t need to plan, provision, or manage infrastructure.
What are the limits of Kinesis Data Analytics?
Key Kinesis Data Analytics Concepts
Amazon Kinesis Data Analytics | Analytics
Kinesis Data Analytics elastically scales your application to accommodate for the data throughput of your source stream and your query complexity for most scenarios. For detailed information on service limits, see Limits in the Amazon Kinesis Data Analytics Developer Guide.
What is a Kinesis Data Analytics application?
Key Kinesis Data Analytics Concepts
Amazon Kinesis Data Analytics | Analytics
An application is the Kinesis Data Analytics entity that you work with. Kinesis Data Analytics applications continuously read and process streaming data in real-time. You write application code using SQL to process the incoming streaming data and produce output. Then, Kinesis Data Analytics writes the output to a configured destination.
Each application consists of three primary components:
Input – The streaming source for your application. In the input configuration, you map the streaming source to an in-application input stream. The in-application stream is like a continuously updating table upon which you can perform SELECT and INSERT SQL operations. Each input record has an associated schema, which is applied as part of inserting the record into the in-application stream.
Application code – A series of SQL statements that process input and produce output. In its simplest form, application code can be a single SQL statement that selects from a streaming input and inserts results into a streaming output. It can also be a series of SQL statements where the output of one feeds into the input of the next SQL statement. Further, you can write application code to split an input stream into multiple streams and then apply additional queries to process these separate streams.
Output – You can create one or more in-application streams to hold intermediate results. You can then optionally configure an application output to persist data from specific in-application streams to an external destination.
What is an in-application stream?
Key Kinesis Data Analytics Concepts
Amazon Kinesis Data Analytics | Analytics
An in-application stream is an entity that continuously stores data in your application for you to perform the SELECT and INSERT SQL operations. You interact with an in-application stream in the same way that you would a SQL table. However, a stream differs from a table in that data is continuously updated. In your application code, you can create additional in-application streams to store intermediate query results. Finally, both your configured input and output represent themselves in your application as in-applications streams.
What inputs are supported in a Kinesis Data Analytics application?
Key Kinesis Data Analytics Concepts
Amazon Kinesis Data Analytics | Analytics
Kinesis Data Analytics supports two types of inputs: streaming data sources and reference data sources. A streaming data source is continuously generated data that is read into your application for processing. A reference data source is static data that your application uses to enrich data coming in from streaming sources. Each application can have no more than one streaming data source and no more than one reference data source. An application continuously reads and processes new data from streaming data sources, including Amazon Kinesis Data Streams or Amazon Kinesis Data Firehose. An application reads a reference data source, including Amazon S3, in its entirety for use in enriching the streaming data source through SQL JOINs.
What is a reference data source?
Key Kinesis Data Analytics Concepts
Amazon Kinesis Data Analytics | Analytics
A reference data source is static data that your application uses to enrich data coming in from streaming sources. You store reference data as an object in your S3 bucket. When the application starts, Kinesis Data Analytics reads the S3 object and creates an in-application SQL table to store the reference data. Your application code can then join it with an in-application stream. You can update the data in the SQL table by calling the UpdateApplication API.
What application code is supported?
Key Kinesis Data Analytics Concepts
Amazon Kinesis Data Analytics | Analytics
Kinesis Data Analytics supports the ANSI SQL with some extensions to the SQL standard to make it easier to work with streaming data. Additionally, Kinesis Data Analytics provides several machine learning algorithms that are exposed as SQL functions including anomaly detection, approximate top-K, and approximate distinct items.
What destinations are supported?
Configuring Input
Amazon Kinesis Data Analytics | Analytics
Kinesis Data Analytics supports up to four destinations per application. You can persist SQL results to Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service (through an Amazon Kinesis Data Firehose), and Amazon Kinesis Data Streams. You can write to a destination not directly supported by Kinesis Data Analytics by sending SQL results to Amazon Kinesis Data Streams, and leveraging its integration with AWS Lambda to send to a destination of your choice.
How do I set up a streaming data source?
Configuring Input
Amazon Kinesis Data Analytics | Analytics
A streaming data source can be an Amazon Kinesis data stream or an Amazon Kinesis Data Firehose delivery stream. Your Kinesis Data Analytics application continuously reads new data from streaming data sources as it arrives in real time. The data is made accessible in your SQL code through an in-application stream. An in-application stream acts like a SQL table because you can create, insert, and select from it. However, the difference is that an in-application stream is continuously updated with new data from the streaming data source.
You can use the AWS Management Console to add a streaming data source. You can learn more about sources in the Configuring Application Input section of the Kinesis Data Analytics Developer Guide.
How do I set up a reference data source?
Configuring Input
Amazon Kinesis Data Analytics | Analytics
A reference data source can be an Amazon S3 object. Your Kinesis Data Analytics application reads the S3 object in its entirety when it starts running. The data is made accessible in your SQL code through a table. The most common use case for using a reference data source is to enrich the data coming from the streaming data source using a SQL JOIN.
Using the AWS CLI, you can add a reference data source by specifying the S3 bucket, object, IAM role, and associated schema. Kinesis Data Analytics loads this data when you start the application, and reloads it each time you make any update API call.
What data formats are supported?
Configuring Input
Amazon Kinesis Data Analytics | Analytics
Kinesis Data Analytics detects the schema and automatically parses UTF-8 encoded JSON and CSV records using the DiscoverInputSchema API. This schema is applied to the data read from the stream as part of the insertion into an in-application stream.
For other UTF-8 encoded data that does not use a delimiter, uses a different delimiter than CSV, or in cases were the discovery API did not fully discover the schema, you can define a schema using the interactive schema editor or use string manipulation functions to structure your data. For more information, see Using the Schema Discovery Feature and Related Editing in the Kinesis Data Analytics Developer Guide.
How is my input stream exposed to my SQL code?
Authoring Application Code
Amazon Kinesis Data Analytics | Analytics
Kinesis Data Analytics applies your specified schema and inserts your data into one or more in-application streams for streaming sources, and a single SQL table for reference sources. The default number of in-application streams is the one that meets the needs of most of your use cases. You should increase this if you find that your application is not keeping up with the latest data in your source stream as defined by CloudWatch metric MillisBehindLatest. The number of in-application streams required is impacted by both the amount of throughput in your source stream and your query complexity. The parameter for specifying the number of in-application streams that are mapped to your source stream is called input parallelism.