AWS Big Data Speciality Flashcards
Spark Patterns and Anti Patterns
Spark Patterns:
- High performance fast engine for processing large amounts of data (In-memory, Disk)
- Faster then running queries in HIVE
- Run queries against live data
- Flexibility in terms of languages
Spark Anti Patterns:
- It is not designed for OLTP
- Not fit for batch processing
- Avoid large multi-user reporting environment with high concurrency
Kinesis Retention Periods
24 Hours to 7 Days
Default is 24 Hours
EMR Consistent View
EMRFS consistent view monitors Amazon S3 list consistency for objects written by or synced with EMRFS, delete consistency for objects deleted by EMRFS, and read-after-write consistency for new objects written by EMRFS.
You can configure additional settings for consistent view by providing them for the
/home/hadoop/conf/emrfs-site.xml
DynamoDB Max number of LSI
5
Kinesis Firehose Handling
- S3 - Retries delivery up to 24 hours
2. Redshift & ElastiSearch : 0-7200 Seconds
Apache Hadoop Modules
Apache Hadoop Modules
- Hadoop Common
- HDFS
- YARN
- MapReduce
Impala
Impala is an open source tool in the Hadoop ecosystem for interactive, ad hoc querying using SQL syntax. Instead of using MapReduce, it leverages a massively parallel processing (MPP) engine similar to that found in traditional relational database management systems (RDBMS).
Kinesis Consumers
Read data from streams:
- for further processing
- data store delivery
Kinesis Streams
Kinesis Streams:
- Receive data from the Producers
- Replicate data over multiple availability zones for durability
- Distribute data among the provisioned shards
EMR Data Compression Formats
Algorithm/Splittable/Comp. Ratio/Co-De Speed
- GZIP/No/High/Medium
- bzip2/Yes/Very High/Slow
- LZO/Yes/Low/Fast
- Snappy/No/Low/Very Fast
Presto - Patterns and Anti-Patterns
Presto Patterns:
- Query different types of data sources - Relational Database, Nosql, HIVE framework, kafka stream processing
- High concurrency
- In-memory processing
Presto Anti-patterns:
- Not fit for Batch Processing
- Not designed for OLTP
- Not fit for large join operations
KPL - Key Concepts
- Include library and use
- Can write to multiple Amazon Kinesis streams
- Error recovery built-in: Retry mechanisms
- Synchronous and asynchronous writing
- Multithreading
- Complement to the Amazon Kinesis Client Library (KCL)
- CloudWatch Integration –Records In/Out/Error
- Batches data records to increase payload size and improve throughput
- Aggregation – multiple data records sent in one transaction; increasing the numbers of records sent per API call
- Collection – takes multiple aggregated records from the previous step and sends them as one HTTP request; further optimizing the data transfer by reducing HTTP request overhead
Resizing EMR Cluster
- Only task nodes can be resized up or down
- Only one master, cannot change that
- Core nodes can only be added
- Even with EMRFS, core nodes have HDFS for processing
- Add task nodes, task node groups when more processing is needed
Redshift Important Operations
Redshift important operations:
- Launch
- Resize
- Vacuum
- Backup & Restore
- Monitoring
DynamoDB Performance Metrics
1 Partition = 10 GB = 3000 RCU & 1000 WCU
RCU - 4KB/sec
WCU- 1 KB/sec
DynamoDB Streams Configuration Views
- KEYS_ONLY
- NEW_IMAGE
- OLD_IMAGE
- NEW_AND_OLD IMAGES
KPL Use Cases
- High rate producers
- Record aggregation
Zookeeper
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
- Coordinates distributed processing
Regression Model
- To predict a numerical value
- RMSE number measures quality of a model
- Lower RMSE better predictions
- RMSE - Root-Mean-Square-Error
Use Cases
- Determine what your house is worth ?
- How many units of product will call ?
Kinesis Agent
- Real-time Kinesis file mediation client written in Java
- Streams files/tails files
- Handles file rotation, check pointing and retry upon failure
- Multiple folders/files to multiple streams
- Transform data prior to streaming: SINGLELINE, CSVTOJSON, LOGTOJSON
- CloudWatch- BytesSent, RecordSendAttempts, RecordSendErrors, ServiceErrors
Kinesis Firehose Destination Data Delivery
- S3
- ElastiSearch
- RedShift
Machine Learning Algorithms
- Supervised Learning - Trained
a. Classification - Is this transaction fraud?
b. Regression - Customer life time value - Unsupervised Learning - Self Learning
a. Clustering - Market Segmentation
EMR Cluster sizing
- Master Node -
m3. xlarge - < 50 nodes, m3.2xlarge >50 nodes
2. Core Nodes - Replication Factor >10 Node cluster - 3 4-9 Node cluster -2 3 Node cluster - 1
HDFS Capacity Formula=
Data Size = Total Storage/Replication Factor
Note: AWS recommends smaller cluster of larger nodes
DynamoDB Performance
DynamoDB Performance
- partitions = Desired RCU/3000 + Desired WCU/1000
- partitions= Data size in GB/10 GB
Quicksight Components
- Data Set
- SPICE - Superfast-Parallel-In-Memory-Calculation Engine
- Measured in GB
- 10GB /user
Kinesis Streams Important Points
- Data can be emitted to S3, DynamoDB, Elastisearch and Redshift using KCL
- Lambda functions can automatically read records from a kinesis stream, process them and send the records to S3, DynamoDB or Redshift
Difference between Kafta and Kinesis
In a nutshell, Kafka is a better option if:
- You have the in-house knowledge to maintain Kafka and Zookeper
- You need to process more than 1000s of events/s
- You don’t want to integrate it with AWS services
Kinesis works best if:
- You don’t have the in-house knowledge to maintain Kafka
- You process 1000s of events/s at most
- You stream data into S3 or Redshift
- You don’t want to build a Kappa architecture
- Max payload size 1 MB
Big Data Visualization
A. Web based Notebooks -
- Zepplin
- Jupyter Notebook - Ipython
B.D3.JS - Data Driven Documents
IoT Limits
- Max 300 MQTT CONNECT requests per second
- Max 9000 publish requests per second
- 3000 in
- 6000 out
- Client connection payload limit 512KB/s
- Shadows deleted after 1 year if not updated or retrieved AWS IoT
- Max 1000 rules per AWS account
- Max 10 actions per rule
Getting data into Kinesis - Third Party Support
- Log4J Appender
- Flume
- Fluentd
Kinesis Producer Library (KPL)
- API
- multiple streams
- multithread (for multicore)
- synchronous and asynchronous
- complement to KCL (kinesis client library)
- cloudwatch - records in/out/error
Methods to load data into Firehose
- Kinesis Agent
2. AWS SDK
Hue
Open source web interface for analyzing data in EMR Amazon S3 and HDFS Browser Hive/Pig Oozie Metastore Manager Job browser and user management
Firehose Data Transformation
With the Firehose data transformation feature, you can now specify a Lambda function that can perform transformations directly on the stream, when you create a delivery stream.
When you enable Firehose data transformation, Firehose buffers incoming data and invokes the specified Lambda function with each buffered batch asynchronously.
To get you started, we provide the following Lambda blueprints, which you can adapt to suit your needs:
- Apache Log to JSON
- Apache Log to CSV
- Syslog to JSON
- Syslog to CSV
- General Firehose Processing
EMR HDFS Parameters
- Replication factor - 3 times
- Block Size: 64 MB - 256 MB
- Replication factor can be configured in hdfs-site.xml
- Block size and Replication factor set per file
ES Stability
3 master nodes
Redshift Data Loading - Data Format
- CSV
- Delimited
- Fixed Width
- JSON
- Avro
Tracking Amazon Kinesis Streams Application State
For each Amazon Kinesis Streams application, the KCL uses a unique Amazon DynamoDB table to keep track of the application’s state. If your Amazon Kinesis Streams application receives provisioned-throughput exceptions, you should increase the provisioned throughput for the DynamoDB table.
For example, if your Amazon Kinesis Streams application does frequent checkpointing or operates on a stream that is composed of many shards, you might need more throughput.
Tracking Amazon Kinesis Streams Application State
For each Amazon Kinesis Streams application, the KCL uses a unique Amazon DynamoDB table to keep track of the application’s state. If your Amazon Kinesis Streams application receives provisioned-throughput exceptions, you should increase the provisioned throughput for the DynamoDB table.
For example, if your Amazon Kinesis Streams application does frequent check pointing or operates on a stream that is composed of many shards, you might need more throughput.
Redshift : CloudHSM Vs. KMS - Security
CloudHSM
- $16k/Year + $5K upfront
- Need to setup HA & Durability
- Single tenant
- Customer managed root of trust
- Symmetric & Asymmetric encryption
- International Common Criteria EAL4 and U.S. Government NIST FIPS 140.2
KMS
- Usage based procing
- Highly Available & Durable
- Multi-tenant
- AWS managed root of trust
- Symmetric encryption only
- Auditing
Machine Learning Summary
- Binary - AUC (Area Under the Curve) - True Positives, True Negatives, False Positives and False Negatives
Only model that can be fine tuned by adjusting the score threshold - Multiclass - Confusion Matrix - (Correct Predictions & Incorrect Predictions)
- Regression - RMSE (Root Mean Square Error) - lower the RMSE the prediction is better
Kinesis Streams - Load/Get Data Options
- Kinesis Producer Library - Producers
- Kinesis Client Library - KCL
- Kinesis Agent
- Kinesis REST API
Redshift - Vacuum - Best Practices
- Vacuum is I/O sensitive
- Perform Vacuum after bulk deletes, data loading or after updates
- Perform Vacuum during lower period of activity or during your maintenance windows
- Vacuum utility is not recommended for tables over 700GB
- Don’t execute Vacuum
Loading data is sort order
Use time series table
WLM - Type of Groups
- User Group
2. Query Group
Redshift Table Design - Constraints
Maintain data integrity
Types for constraints
- Primary Key
- Unique
- Nut null/null
- References
- Foreign Key
Except NotNull/Null we can’t enforce any constraints
Hunk
Hunk is a web-based interactive data analytics
platform for rapidly exploring, analysing and
visualizing data in Hadoop and NoSQL data stores
Types of Analysis
- Pre-processing: filtering, transformations
- Basic Analytics: Simple counts, aggregates over windows
- Advanced Analytics: Detecting anomalies, event correlation
- Post-processing: Alerting, triggering, final filters
Jupyter Notebook
Jupyter is a web-based notebook for running Python,
R, Scala and other languages to process and visualize
data, perform statistical analysis, and train and run
machine learning models
Kinesis Streams - Kinesis Connectors available for
DynamoDB
S3
Elastisearch
Redshift
Redshift Important System Tables
- STL_LOAD_ERRORS
2. STL_LOADERROR_DETAIL
Apache Ranger
Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Features include a centralized security administration, fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN) and central auditing. It uses agents to sync policies and users, and plugins that run within the same process as the Hadoop component, for example, NameNode.
Redshift - Encryption in Transit
Redshift - Encryption in transit
- Create parameter group
- SSL Certificate
Machine Learning Use Cases
- Fraud Detection
- Customer Service
- Litigation/Legal
- Security
- Healthcare
- Sports
Kinesis Firehose Important Parameters
Buffer Size - 1 MB - 128 MB
Buffer Interval - 60-900 Seconds
Parameters for transformation:
- Record ID
- Result : OK, Dropped & Processing Failed
- Data
Oozie
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
Redshift - Encryption at Rest
- KMS
- HSM (CloudHSM & On-prem HSM)
Note: In Redshift it will encrypt Data Blocks, system Metadata & snapshots
Tez
- Tez is an engine to process complex Directed Acyclic
Graph (DAG) - It can be used in place of Hadoop for running Pig and
Hive - Runs on top of YARN
SQS vs Kinesis Streams
SQS - Message Queue
Kinesis Streams - Real time processing
DynamoDB Performance Points
- Use GSI
- Use Burst /spread periodic batch writes/SQS managed write buffer - In case of uneven writes
- Use Cashing - In case of uneven Read
Redshift features
- Petabyte scale data warehouse services
- OLAP & BI Use cases
- ANSI SQL Compliance
- Column Oriented
- MPP Architecture
- Node Types:
a. Dense Compute (DC1 and DC2)
b. Dense Storage (DS2)
Single AZ Implementation
Machine Learning Limits
Max observation size (target+attributes): 100KB
Max training data size: 100GB
Max batch predictions data size: 1TB
Max batch predictions data records: 100 million
Max columns in schema: 1000
Real-time prediction endpoint TPS: 200
Number of classes for multiclass ML models: 100
Kinesis: key features
Kinesis : Key Features
- Real time data streaming
- Ordered record delivery
- Replication to three availability zone
- de-coupled from consuming application
- replay data
- zero downtime scaling
- pay as you go
- parallel processing - multiple producers and consumers
EMR - S3DistCP
In the Hadoop ecosystem, DistCp is often used to move data. DistCp provides a distributed copy capability built on top of a MapReduce framework. S3DistCp is an extension to DistCp that is optimized to work with S3 and that adds several useful features.
- Copy or move files without transformation
- Copy and change file compression on the fly
- Copy files incrementally
- Copy multiple folders in one job
- Aggregate files based on a pattern
- Upload files larger than 1 TB in size
- Submit a S3DistCp step to an EMR cluster
Spark Components
- Spark Core - Dispatch & Scheduling tasks
- Spark SQL - Execute low-latency interactive SQL query against structured data
- Spark Streaming - Stream processing of live data streams
- MLib - Scalable Machine Learning Library
- GraphX - Graphs parallel computation
Flume
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tuneable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
Redshift Cluster Resizing
- Creation of Destination Cluster
- Source cluster restart and enter into read-only mode
- Reconnect to source cluster to run queries in the read-only mode
- Redshift start copy data from source to target cluster
- Once copy is over, Redshift updates the DNS endpoint to target cluster
- Source cluster will be decommissioned
SDK Use Cases
Low rate producers
Mobile apps
IoT devices
Web clients
Redshift Table Design - Distribution Style
- Even
Rows distributed across the slice regardless of value in a particular column
Default Distribution style - Key
Distribute data evenly among slices
Collocate matching raws in the same slice
Improve the performace
Use cases - Join tables, larger fact tables
All
Copy of entire table is stored in all nodes Need more space due to duplication Use Case: - Static Data - Small size of table - No common distribution key
Redshift - Slices Guidelines
No. of data files should be equal to no. of slices or multiple of the no. of slices
i.e. 4 slices = 4 files or 8 files
32 slices = 32 files or 64 files
File compression - gzip, lzop, bzip2
Glacier - Vault Lock Policy
Amazon Glacier Vault Lock allows you to easily deploy and enforce compliance controls for individual Amazon Glacier vaults with a vault lock policy.
Use Cases
- Time based retention
- Undeletable
Kinesis Analytics Use Cases
- Mobile app live monitoring
- Clickstream analytics
- Logs
- Metering records
- IoT data
EMR - Resizing Cluster
- Manually - Options
- Terminate at instance hour
- Terminate at task completion
- Autoscaling
How to decide between more nodes vs. bigger nodes generally - Amazon Redshift ?
Bigger nodes are better for long running queries
eg: fewer dc1.8xlarge better than more dc1.large
More nodes are better for short running queries
eg: more dc1.large better than fewer dc1.8xlarge
WLM Settings
1. Reboot of RedShift Cluster required to reflect the changes User Group Query Group User Group Wildcard Query Group Wildcard
- No reboot required for parameters
Concurrency
% of Memory
Timeout
What are the different types of KPL Batching?
- Collection - Group of stream records and batching them to reduce HTTP requests
- Aggregation - Allows you to combine multiple user records into a single stream
Data lake
S3
Kinesis Stream - KPL Anti-pattern
In case producer application/use case can’t incur an additional processing delay
DynamoDB - Integration with AWS Services
- Redshift - Copy command to transfer the data
- EMR - Using HIVE can read & write data from DynamoDB
- S3 - Export & Import to S3
- Datapipeline - Mediator to copy the data to and from S3
- Lambda - Event based action
- Kinesis streams - Streaming Data
- EC2 Instance - Streaming Data
HIVE Patterns
- Process and analyse logs
- Join very large tables
- Batch jobs
- Ad-hoc interactive queries
Difference between Supervised and Unsupervised Learning
Unsupervised Learning:
- Unlabeled Data
- No knowledge of output
- Self guided learning algorithm
- Aim: to figure out the data patterns and grouping
Supervised Learning:
- Labelled Data
- Desired outcome is known
- Providing the algorithm training data to learn from
- Aim: Predictive Analytics
Sqoop
Sqoop is a tool for data migration between Amazon
S3, Hadoop, HDFS, and RDBMS databases including
Redshift
- Parallel data transfer for faster export and ingestion
- Batched transfer, not meant for interactive queries
Redshift Data Model
Redshift Data Model
1. Star Schema - Consist of one or more fact tables referencing any number of dimension tables
- Fact Table - Consists of measurements metrics of fact of a business process
- Dimension Table- Stores dimensions that describes objects in the fact table
EMR - Data at Rest Encryption
- EC2 Cluster Nodes
a. Open Source HDFS Encryption
b. LUKS Encryption - Foe EMRFS on S3
a. SSE-S3
b. SSE-KMS
c. CSE-KMS
d. CSE- Custom
Quicksight Visualizations
Quicksight Visualizations
- 20 visuals per analysis
- Quicksight can determine most appropriate visual types for you
- Dimensions & Measures (fields)
Lambda Patterns
Lambda Patterns
- Real-time file processing
- Real-time stream processing
- Extract, transform, and load
- Replace cron
- Process AWS events
SQS Features
- 256 KB Messages
- Messages can be retained for 14 Days
- Two important Architectures
- SQS Priority Architecture
- Fanout Architecture
EMR Security
Controls:
- Security Groups
a. Default & b. EMR Managed - IAM Roles
- Default Role, EC2 Default Role & Autoscaling Default Role
- Private Subnet
- Encryption at Rest
- Encryption in transmit
EMR Anti-Patterns
Small data sets – Amazon EMR is built for massive parallel processing; if your data set is small enough to run quickly on a single machine, in a single thread, the added overhead to map and reduce jobs may not be worth it for small data sets that can easily be processed in memory on a single system.
ACID transaction requirements – While there are ways to achieve ACID (atomicity, consistency, isolation, durability) or limited ACID on Hadoop, another database, such as Amazon RDS or relational database running on Amazon EC2, may be a better option for workloads with stringent requirements.
Amazon
Kinesis Streams Features
Kinesis Streams Features:
- Streams receive data from the Producers
- Replicate data over multiple availability zones for durability
- Distribute data among the provisioned shards
EMR HDFS Parameters
- Replication factor - 3 times
- Block Size - 64 MB - 256 MB
- Replication factor can be configured in hdfs-site.xml
- Block size and Replication factor set per file
EMR File Formats
EMR File Formats
- Text
- Parquest
- ORC
- Sequence
- AVRO
Keep GZIP files 1-2 GB Range
Avoid smaller files <100 MB
s3Distcp can be used to copy data between S3 HDFS or viceversa
IoT Authentication
IoT Authentication:
- X.509 Certificate
- Cognito Identity
EMR Storage Options
- Instance Store
- EBS for HDFS
- EMRFS - S3
EMRFS & HDFS can be used together
Copy data from S3 to HDFS using S3DistCP
Data Pipeline Components
- Data Nodes
- Activities
- Preconditions
- Schedules
Redshift Table Design - Compression
- Automatic - Recommended by AWS
- Manual
Use “Encode” to compress column
Kinesis Streams - Best Practices
- Start off with multiple shards
- Have multiple consumers for A/B testing without downtime
- Dump data to S3 when possible; it’s cheap and durable
- Use the same stream for data archival and analytics
- Lambda for transformations and processing
- Use logic in consumer if you need only-once delivery; keep state in DynamoDB
- Tag streams for cost segregation
Redshift Deep Copy
A deep copy recreates and repopulates a table by using a bulk insert, which automatically
sorts the table. If a table has a large unsorted region, a deep copy is much faster than a
vacuum. The trade o is that you cannot make concurrent updates during a deep copy
operation, which you can do during a vacuum.
Options:
1. To perform a deep copy using the original table DDL
2. To perform a deep copy using CREATE TABLE LIKE
3. To perform a deep copy by creating a temporary table and truncating the original
table
Note: 1st method is preferred over others 2
Redshift - Encryption Keys Hierarchy
- Master Key
- Cluster Encryption Key
- Database Encryption Key
- Data Encryption Key
Redshift Data Loading - Manifest
- Load required files only
- Load files from different bucket
- Load files with different prefix
- JSON format
Spark on EMR
- Spark framework replaces MapReduce framework
- Spark processing engine will be deployed in each node of cluster
- Spark SQL can interact with S3 or HDFS
Redshift WLM Features
Redshift WLM Features
- Manages separate queue for long running and short running queries
- Configure memory allocation to queues
- Improve performance & expenses
Which data Ingestion Tool is similar to Kinesis?
Kafta
Kinesis - Producers
- Producers add data records to Kinesis streams
- A data record must contain:
- Name of the stream
- Partition Key
- Data Content
- Single data records can be added using the -PutRecord API
- Multiple data records can be added at one time using the PutRecords API
KCL - Features
Consumes and processes data from an Amazon Kinesis stream
KCL Libraries available for Java, Ruby, Node, Go, and a Multi-Lang Implementation with Native Python support
Creates a DynamoDB table (with the same name as your application) to manage state
Make sure you don’t have name conflicts with any existing DynamoDB table and your app name (same region)
Multiple KCLs can seamlessly works on the same or different
streams
Checkpoints processed records
KCLs can load balance among each other
Automatically deal with stream scaling like shard splits and merges
Key performance indicators of the KCL like records processed (size, count, latency, age)
MillisBehindLatest - How far behind the KCL is
What is Mahout?
Mahout is a machine learning library with tools for
clustering, classification, and several types of
recommenders, including tools to calculate most similar
items or build item recommendations for users
Redshift Important Performance Metrics
Redshift Important Performance Metrics:
- Number of nodes, processors or slices
- Node Types
- Data Distribution
- Data Sort Order
- Dataset size
- Concurrent Operations
- Query Structure
- Code Compilation
EMR Important Web Interfaces
YARN ResourceManager http://master-public-dns-name:8088/
YARN NodeManager http://slave-public-dns-name:8042/
Hadoop HDFS NameNode http://master-public-dns-name:50070/
Hadoop HDFS DataNode http://slave-public-dns-name:50075/
Spark HistoryServer http://master-public-dns-name:18080/
What are the differences between LSI and GSI?
Global secondary index (GSI.html) — an index with a partition key and a sort key that can be different from those on the base table.
- A global secondary index is considered “global” because queries on the index can span all of the data in the base table, across all partitions.
- It can be created any time past table creation
- Not shares RCU & WCU with tables
LSI - an index that has the same partition key as the
base table, but a dierent sort key.
- A local secondary index is “local” in the sense that
every partition of a local secondary index is scoped to a base table partition that has the same partition key value.
- It can be only created during table creation
- Shares RCU & WCU with tables
Kinesis Streams - Shard Capacity
Kinesis Streams - Shard Capacity:
- 1 MB/sec Data Input
- 2 MB/sec Data Output
- 5 transactions/sec for read
- 1000 records/sec for writes
Which API is used to add:
1) Single Data Records?
2) Multiple Data Records?
1) PutRecord for single data records
2) PutRecords for multiple data records
What is Apache Ranger?
Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Features include a centralized security administration, fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN) and central auditing. It uses agents to sync policies and users, and plugins that run within the same process as the Hadoop component, for example, NameNode.
Redshift unloading data - Encryption Option
- SSE-S3
- SSE-KMS
- CSE-CMK
What is Kinesis Analytics?
- Amazon Kinesis Analytics enables users to run standard SQL queries over live streaming data
- Readily query Kinesis Streams/Firehose data and export the output to destinations like S3
EMR Use Cases
- Log Processing/Analytics
- ETL
- Clickstream Analytics
- Machine Learning
What is Spark RDD?
Resilient Distributed Dataset (RDD) is core of Spark
Binary Classification Model
- To predict binary outcome
- AUC (Area under curve) measures the prediction accuracy of model(0 to 1)
- Important parameters (Histogram, cut-off threshold)
Kinesis Producer Library (KPL)
- API
- Multiple streams
- Multithread (for multicore)
- Synchronous and asynchronous
- Complement to KCL (kinesis client library)
- Cloudwatch - records in/out/error
Multiclass Classification Model
- To generate predictions for multiple classes
- F1 score measures quality of a model (0 to 1)
- Confusion Matrix is used
HIVE on EMR Integration
S3
DynamoDB
What is Elastisearch?
- It is distributed multi tenant-capable full text search engine
- HTTP Web Interface
- It can be integrated with Logstash & Kibana
ELK Stack- Logstash - Data collection & log parsing engine
- Kibana - Open source data visualization and exploration tool
EMR - Long Running vs Transient Cluster
Long Running -
- Cluster stays up & running for queries against HBASE
- Jobs on the cluster run frequently
Transient Cluster -
- Temporary cluster that shuts down after processing
- Good use case is Batch Job
Quicksight Visualization Types
- Autograph
- Bar-chart - Vertical & Horizontal
- Line Charts
- Gross sales by month
- Gross sales and net sales by month
- Measure of a dimension over a period of time - Pivot Table
- A way to summarize data - Scaler Plot
- Two or Three measures of a dimension - Tree Map
- One to two measure for a domain - Pie Chart
- Compare values for diff dimensions - Heat Map
- Identify trends & outliers - Story
- Create narrative by presenting iteration - Dashboard
- Read only snapshot of analysis
Redshift - Cross Region Snapshots
Cross Region KMS Encrypted Snapshots for KMS encrypted clusters
- Snapshot encrypted
Redshift Table Design - Key Factors
- Architecture
- Distribution Styles
- Sort Keys
- Compression
- Constraints
- Column Sizing
- Data Types
What is Hcatalog?
Hcatalog is a table storage manager for Hadoop
It can store data in any format and make it
available to external systems like Hive and Pig
It can write files in many formats like RCFile,
CSV, JSON, and SequenceFile, and ORC or
custom formats
What is Redshift Vacuum?
Vacuum helps to recover the space and sort the table.
Vacuum Options: Full, Sort, Delete
Note: In case of any update and deletion of any row Redshift will not free up the space
What is Pig?
Pig is an open source analytics package that runs on top of Hadoop. Pig is operated by Pig Latin, a SQL-like language which allows users to structure, summarize, and query data.
Elastisearch Use Cases
- Logging & Analysis
- Distributed document store
- Realtime application monitoring
- Clickstream weblog ingestion
Redshift Table - Sort Keys
- Single
- Compound
- Interleaved
What is Zepplin?
- Zeppelin is a web-based notebook that enables interactive data analytics - Ingestion, discovery, analytics, visualization and collaboration - Connectors for HDFS/Hbase/Hive/Spark Flink PostgreSQL/Redshift ElasticSearch