Important stuff_AWS Certification Flashcards
Sagemaker
Data usually comes from S3, but can also ingest from Athena, EMR, Redshift and Amazon Keyspaces DB
Cannot be deployed on an EMR cluster
Kinesis Firehose data conversion capabilities
Kinesis Firehose has the ability to convert JSON data to Parquet or ORC format on the fly.
Vanishing Gradient
A “vanishing gradient” results from multiplying together many small derivates of the sigmoid activation function in multiple layers. ReLU does not have a small derivative, and avoids this problem.
Quicksight
can directly read from S3
Kinesis Data Firehose
- Fully Managed Service, no administration
- Near Real Time (60 seconds latency minimum for non full batches)
- Data Ingestion into Redshift / Amazon S3 / ElasticSearch / Splunk, ie. Load data into these services
- Automatic scaling
- Supports many data formats
- Data Conversions from CSV / JSON to Parquet / ORC (only for S3)
- Data Transformation through AWS Lambda (ex: CSV => JSON)
- Supports compression when target is Amazon S3 (GZIP, ZIP, and SNAPPY)
- Pay for the amount of data going through Firehose
convert to LibSVM
Neither Glue ETL nor Kinesis Analytics can convert to LibSVM format, and scikit-learn is not a distributed solution.
Binning
Binning is the process of converting numeric data into categorical data.
Kinesis Producer Library
The KPL is an easy-to-use, highly configurable library that helps you write to a Kinesis data stream. It acts as an intermediary between your producer application code and the Kinesis Data Streams API actions
The KPL can help build high-performance producers
Lamda Functions
Not meant to handle ETL jobs
Spark
not used for OLTP or batch processing jobs, more for transforming data as it comes in.