AWS Big Data Speciality Flashcards
Spark Patterns and Anti Patterns
Spark Patterns:
- High performance fast engine for processing large amounts of data (In-memory, Disk)
- Faster then running queries in HIVE
- Run queries against live data
- Flexibility in terms of languages
Spark Anti Patterns:
- It is not designed for OLTP
- Not fit for batch processing
- Avoid large multi-user reporting environment with high concurrency
Kinesis Retention Periods
24 Hours to 7 Days
Default is 24 Hours
EMR Consistent View
EMRFS consistent view monitors Amazon S3 list consistency for objects written by or synced with EMRFS, delete consistency for objects deleted by EMRFS, and read-after-write consistency for new objects written by EMRFS.
You can configure additional settings for consistent view by providing them for the
/home/hadoop/conf/emrfs-site.xml
DynamoDB Max number of LSI
5
Kinesis Firehose Handling
- S3 - Retries delivery up to 24 hours
2. Redshift & ElastiSearch : 0-7200 Seconds
Apache Hadoop Modules
Apache Hadoop Modules
- Hadoop Common
- HDFS
- YARN
- MapReduce
Impala
Impala is an open source tool in the Hadoop ecosystem for interactive, ad hoc querying using SQL syntax. Instead of using MapReduce, it leverages a massively parallel processing (MPP) engine similar to that found in traditional relational database management systems (RDBMS).
Kinesis Consumers
Read data from streams:
- for further processing
- data store delivery
Kinesis Streams
Kinesis Streams:
- Receive data from the Producers
- Replicate data over multiple availability zones for durability
- Distribute data among the provisioned shards
EMR Data Compression Formats
Algorithm/Splittable/Comp. Ratio/Co-De Speed
- GZIP/No/High/Medium
- bzip2/Yes/Very High/Slow
- LZO/Yes/Low/Fast
- Snappy/No/Low/Very Fast
Presto - Patterns and Anti-Patterns
Presto Patterns:
- Query different types of data sources - Relational Database, Nosql, HIVE framework, kafka stream processing
- High concurrency
- In-memory processing
Presto Anti-patterns:
- Not fit for Batch Processing
- Not designed for OLTP
- Not fit for large join operations
KPL - Key Concepts
- Include library and use
- Can write to multiple Amazon Kinesis streams
- Error recovery built-in: Retry mechanisms
- Synchronous and asynchronous writing
- Multithreading
- Complement to the Amazon Kinesis Client Library (KCL)
- CloudWatch Integration –Records In/Out/Error
- Batches data records to increase payload size and improve throughput
- Aggregation – multiple data records sent in one transaction; increasing the numbers of records sent per API call
- Collection – takes multiple aggregated records from the previous step and sends them as one HTTP request; further optimizing the data transfer by reducing HTTP request overhead
Resizing EMR Cluster
- Only task nodes can be resized up or down
- Only one master, cannot change that
- Core nodes can only be added
- Even with EMRFS, core nodes have HDFS for processing
- Add task nodes, task node groups when more processing is needed
Redshift Important Operations
Redshift important operations:
- Launch
- Resize
- Vacuum
- Backup & Restore
- Monitoring
DynamoDB Performance Metrics
1 Partition = 10 GB = 3000 RCU & 1000 WCU
RCU - 4KB/sec
WCU- 1 KB/sec
DynamoDB Streams Configuration Views
- KEYS_ONLY
- NEW_IMAGE
- OLD_IMAGE
- NEW_AND_OLD IMAGES
KPL Use Cases
- High rate producers
- Record aggregation
Zookeeper
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
- Coordinates distributed processing
Regression Model
- To predict a numerical value
- RMSE number measures quality of a model
- Lower RMSE better predictions
- RMSE - Root-Mean-Square-Error
Use Cases
- Determine what your house is worth ?
- How many units of product will call ?
Kinesis Agent
- Real-time Kinesis file mediation client written in Java
- Streams files/tails files
- Handles file rotation, check pointing and retry upon failure
- Multiple folders/files to multiple streams
- Transform data prior to streaming: SINGLELINE, CSVTOJSON, LOGTOJSON
- CloudWatch- BytesSent, RecordSendAttempts, RecordSendErrors, ServiceErrors
Kinesis Firehose Destination Data Delivery
- S3
- ElastiSearch
- RedShift
Machine Learning Algorithms
- Supervised Learning - Trained
a. Classification - Is this transaction fraud?
b. Regression - Customer life time value - Unsupervised Learning - Self Learning
a. Clustering - Market Segmentation
EMR Cluster sizing
- Master Node -
m3. xlarge - < 50 nodes, m3.2xlarge >50 nodes
2. Core Nodes - Replication Factor >10 Node cluster - 3 4-9 Node cluster -2 3 Node cluster - 1
HDFS Capacity Formula=
Data Size = Total Storage/Replication Factor
Note: AWS recommends smaller cluster of larger nodes
DynamoDB Performance
DynamoDB Performance
- partitions = Desired RCU/3000 + Desired WCU/1000
- partitions= Data size in GB/10 GB
Quicksight Components
- Data Set
- SPICE - Superfast-Parallel-In-Memory-Calculation Engine
- Measured in GB
- 10GB /user
Kinesis Streams Important Points
- Data can be emitted to S3, DynamoDB, Elastisearch and Redshift using KCL
- Lambda functions can automatically read records from a kinesis stream, process them and send the records to S3, DynamoDB or Redshift
Difference between Kafta and Kinesis
In a nutshell, Kafka is a better option if:
- You have the in-house knowledge to maintain Kafka and Zookeper
- You need to process more than 1000s of events/s
- You don’t want to integrate it with AWS services
Kinesis works best if:
- You don’t have the in-house knowledge to maintain Kafka
- You process 1000s of events/s at most
- You stream data into S3 or Redshift
- You don’t want to build a Kappa architecture
- Max payload size 1 MB
Big Data Visualization
A. Web based Notebooks -
- Zepplin
- Jupyter Notebook - Ipython
B.D3.JS - Data Driven Documents
IoT Limits
- Max 300 MQTT CONNECT requests per second
- Max 9000 publish requests per second
- 3000 in
- 6000 out
- Client connection payload limit 512KB/s
- Shadows deleted after 1 year if not updated or retrieved AWS IoT
- Max 1000 rules per AWS account
- Max 10 actions per rule
Getting data into Kinesis - Third Party Support
- Log4J Appender
- Flume
- Fluentd
Kinesis Producer Library (KPL)
- API
- multiple streams
- multithread (for multicore)
- synchronous and asynchronous
- complement to KCL (kinesis client library)
- cloudwatch - records in/out/error
Methods to load data into Firehose
- Kinesis Agent
2. AWS SDK
Hue
Open source web interface for analyzing data in EMR Amazon S3 and HDFS Browser Hive/Pig Oozie Metastore Manager Job browser and user management
Firehose Data Transformation
With the Firehose data transformation feature, you can now specify a Lambda function that can perform transformations directly on the stream, when you create a delivery stream.
When you enable Firehose data transformation, Firehose buffers incoming data and invokes the specified Lambda function with each buffered batch asynchronously.
To get you started, we provide the following Lambda blueprints, which you can adapt to suit your needs:
- Apache Log to JSON
- Apache Log to CSV
- Syslog to JSON
- Syslog to CSV
- General Firehose Processing
EMR HDFS Parameters
- Replication factor - 3 times
- Block Size: 64 MB - 256 MB
- Replication factor can be configured in hdfs-site.xml
- Block size and Replication factor set per file
ES Stability
3 master nodes
Redshift Data Loading - Data Format
- CSV
- Delimited
- Fixed Width
- JSON
- Avro
Tracking Amazon Kinesis Streams Application State
For each Amazon Kinesis Streams application, the KCL uses a unique Amazon DynamoDB table to keep track of the application’s state. If your Amazon Kinesis Streams application receives provisioned-throughput exceptions, you should increase the provisioned throughput for the DynamoDB table.
For example, if your Amazon Kinesis Streams application does frequent checkpointing or operates on a stream that is composed of many shards, you might need more throughput.
Tracking Amazon Kinesis Streams Application State
For each Amazon Kinesis Streams application, the KCL uses a unique Amazon DynamoDB table to keep track of the application’s state. If your Amazon Kinesis Streams application receives provisioned-throughput exceptions, you should increase the provisioned throughput for the DynamoDB table.
For example, if your Amazon Kinesis Streams application does frequent check pointing or operates on a stream that is composed of many shards, you might need more throughput.
Redshift : CloudHSM Vs. KMS - Security
CloudHSM
- $16k/Year + $5K upfront
- Need to setup HA & Durability
- Single tenant
- Customer managed root of trust
- Symmetric & Asymmetric encryption
- International Common Criteria EAL4 and U.S. Government NIST FIPS 140.2
KMS
- Usage based procing
- Highly Available & Durable
- Multi-tenant
- AWS managed root of trust
- Symmetric encryption only
- Auditing
Machine Learning Summary
- Binary - AUC (Area Under the Curve) - True Positives, True Negatives, False Positives and False Negatives
Only model that can be fine tuned by adjusting the score threshold - Multiclass - Confusion Matrix - (Correct Predictions & Incorrect Predictions)
- Regression - RMSE (Root Mean Square Error) - lower the RMSE the prediction is better
Kinesis Streams - Load/Get Data Options
- Kinesis Producer Library - Producers
- Kinesis Client Library - KCL
- Kinesis Agent
- Kinesis REST API
Redshift - Vacuum - Best Practices
- Vacuum is I/O sensitive
- Perform Vacuum after bulk deletes, data loading or after updates
- Perform Vacuum during lower period of activity or during your maintenance windows
- Vacuum utility is not recommended for tables over 700GB
- Don’t execute Vacuum
Loading data is sort order
Use time series table
WLM - Type of Groups
- User Group
2. Query Group
Redshift Table Design - Constraints
Maintain data integrity
Types for constraints
- Primary Key
- Unique
- Nut null/null
- References
- Foreign Key
Except NotNull/Null we can’t enforce any constraints
Hunk
Hunk is a web-based interactive data analytics
platform for rapidly exploring, analysing and
visualizing data in Hadoop and NoSQL data stores
Types of Analysis
- Pre-processing: filtering, transformations
- Basic Analytics: Simple counts, aggregates over windows
- Advanced Analytics: Detecting anomalies, event correlation
- Post-processing: Alerting, triggering, final filters
Jupyter Notebook
Jupyter is a web-based notebook for running Python,
R, Scala and other languages to process and visualize
data, perform statistical analysis, and train and run
machine learning models
Kinesis Streams - Kinesis Connectors available for
DynamoDB
S3
Elastisearch
Redshift
Redshift Important System Tables
- STL_LOAD_ERRORS
2. STL_LOADERROR_DETAIL
Apache Ranger
Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Features include a centralized security administration, fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN) and central auditing. It uses agents to sync policies and users, and plugins that run within the same process as the Hadoop component, for example, NameNode.
Redshift - Encryption in Transit
Redshift - Encryption in transit
- Create parameter group
- SSL Certificate