AWS Big Data Speciality Flashcards

1
Q

Spark Patterns and Anti Patterns

A

Spark Patterns:

  1. High performance fast engine for processing large amounts of data (In-memory, Disk)
  2. Faster then running queries in HIVE
  3. Run queries against live data
  4. Flexibility in terms of languages

Spark Anti Patterns:

  1. It is not designed for OLTP
  2. Not fit for batch processing
  3. Avoid large multi-user reporting environment with high concurrency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Kinesis Retention Periods

A

24 Hours to 7 Days

Default is 24 Hours

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

EMR Consistent View

A

EMRFS consistent view monitors Amazon S3 list consistency for objects written by or synced with EMRFS, delete consistency for objects deleted by EMRFS, and read-after-write consistency for new objects written by EMRFS.

You can configure additional settings for consistent view by providing them for the
/home/hadoop/conf/emrfs-site.xml

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

DynamoDB Max number of LSI

A

5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Kinesis Firehose Handling

A
  1. S3 - Retries delivery up to 24 hours

2. Redshift & ElastiSearch : 0-7200 Seconds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Apache Hadoop Modules

A

Apache Hadoop Modules

  1. Hadoop Common
  2. HDFS
  3. YARN
  4. MapReduce
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Impala

A

Impala is an open source tool in the Hadoop ecosystem for interactive, ad hoc querying using SQL syntax. Instead of using MapReduce, it leverages a massively parallel processing (MPP) engine similar to that found in traditional relational database management systems (RDBMS).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Kinesis Consumers

A

Read data from streams:

  1. for further processing
  2. data store delivery
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Kinesis Streams

A

Kinesis Streams:

  • Receive data from the Producers
  • Replicate data over multiple availability zones for durability
  • Distribute data among the provisioned shards
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

EMR Data Compression Formats

A

Algorithm/Splittable/Comp. Ratio/Co-De Speed

  1. GZIP/No/High/Medium
  2. bzip2/Yes/Very High/Slow
  3. LZO/Yes/Low/Fast
  4. Snappy/No/Low/Very Fast
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Presto - Patterns and Anti-Patterns

A

Presto Patterns:

  1. Query different types of data sources - Relational Database, Nosql, HIVE framework, kafka stream processing
  2. High concurrency
  3. In-memory processing

Presto Anti-patterns:

  1. Not fit for Batch Processing
  2. Not designed for OLTP
  3. Not fit for large join operations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

KPL - Key Concepts

A
  • Include library and use
  • Can write to multiple Amazon Kinesis streams
  • Error recovery built-in: Retry mechanisms
  • Synchronous and asynchronous writing
  • Multithreading
  • Complement to the Amazon Kinesis Client Library (KCL)
  • CloudWatch Integration –Records In/Out/Error
  • Batches data records to increase payload size and improve throughput
  • Aggregation – multiple data records sent in one transaction; increasing the numbers of records sent per API call
  • Collection – takes multiple aggregated records from the previous step and sends them as one HTTP request; further optimizing the data transfer by reducing HTTP request overhead
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Resizing EMR Cluster

A
  • Only task nodes can be resized up or down
  • Only one master, cannot change that
  • Core nodes can only be added
  • Even with EMRFS, core nodes have HDFS for processing
  • Add task nodes, task node groups when more processing is needed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Redshift Important Operations

A

Redshift important operations:

  1. Launch
  2. Resize
  3. Vacuum
  4. Backup & Restore
  5. Monitoring
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

DynamoDB Performance Metrics

A

1 Partition = 10 GB = 3000 RCU & 1000 WCU

RCU - 4KB/sec
WCU- 1 KB/sec

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

DynamoDB Streams Configuration Views

A
  1. KEYS_ONLY
  2. NEW_IMAGE
  3. OLD_IMAGE
  4. NEW_AND_OLD IMAGES
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

KPL Use Cases

A
  • High rate producers

- Record aggregation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Zookeeper

A

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
- Coordinates distributed processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Regression Model

A
  • To predict a numerical value
  • RMSE number measures quality of a model
  • Lower RMSE better predictions
  • RMSE - Root-Mean-Square-Error

Use Cases

  1. Determine what your house is worth ?
  2. How many units of product will call ?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Kinesis Agent

A
  1. Real-time Kinesis file mediation client written in Java
  2. Streams files/tails files
  3. Handles file rotation, check pointing and retry upon failure
  4. Multiple folders/files to multiple streams
  5. Transform data prior to streaming: SINGLELINE, CSVTOJSON, LOGTOJSON
  6. CloudWatch- BytesSent, RecordSendAttempts, RecordSendErrors, ServiceErrors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Kinesis Firehose Destination Data Delivery

A
  1. S3
  2. ElastiSearch
  3. RedShift
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Machine Learning Algorithms

A
  1. Supervised Learning - Trained
    a. Classification - Is this transaction fraud?
    b. Regression - Customer life time value
  2. Unsupervised Learning - Self Learning
    a. Clustering - Market Segmentation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

EMR Cluster sizing

A
  1. Master Node -
    m3. xlarge - < 50 nodes, m3.2xlarge >50 nodes
2. Core Nodes -
Replication Factor
>10 Node cluster - 3
4-9 Node cluster -2
3 Node cluster - 1

HDFS Capacity Formula=

Data Size = Total Storage/Replication Factor

Note: AWS recommends smaller cluster of larger nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

DynamoDB Performance

A

DynamoDB Performance

  1. partitions = Desired RCU/3000 + Desired WCU/1000
  2. partitions= Data size in GB/10 GB
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Quicksight Components

A
  1. Data Set
  2. SPICE - Superfast-Parallel-In-Memory-Calculation Engine
    - Measured in GB
    - 10GB /user
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Kinesis Streams Important Points

A
  1. Data can be emitted to S3, DynamoDB, Elastisearch and Redshift using KCL
  2. Lambda functions can automatically read records from a kinesis stream, process them and send the records to S3, DynamoDB or Redshift
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Difference between Kafta and Kinesis

A

In a nutshell, Kafka is a better option if:

  • You have the in-house knowledge to maintain Kafka and Zookeper
  • You need to process more than 1000s of events/s
  • You don’t want to integrate it with AWS services

Kinesis works best if:

  • You don’t have the in-house knowledge to maintain Kafka
  • You process 1000s of events/s at most
  • You stream data into S3 or Redshift
  • You don’t want to build a Kappa architecture
  • Max payload size 1 MB
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Big Data Visualization

A

A. Web based Notebooks -

  1. Zepplin
  2. Jupyter Notebook - Ipython

B.D3.JS - Data Driven Documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

IoT Limits

A
  • Max 300 MQTT CONNECT requests per second
  • Max 9000 publish requests per second
    • 3000 in
    • 6000 out
  • Client connection payload limit 512KB/s
  • Shadows deleted after 1 year if not updated or retrieved AWS IoT
  • Max 1000 rules per AWS account
  • Max 10 actions per rule
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Getting data into Kinesis - Third Party Support

A
  • Log4J Appender
  • Flume
  • Fluentd
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Kinesis Producer Library (KPL)

A
  • API
  • multiple streams
  • multithread (for multicore)
  • synchronous and asynchronous
  • complement to KCL (kinesis client library)
  • cloudwatch - records in/out/error
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Methods to load data into Firehose

A
  1. Kinesis Agent

2. AWS SDK

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Hue

A
Open source web interface for analyzing data in EMR
Amazon S3 and HDFS Browser
Hive/Pig
Oozie
Metastore Manager
Job browser and user management
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Firehose Data Transformation

A

With the Firehose data transformation feature, you can now specify a Lambda function that can perform transformations directly on the stream, when you create a delivery stream.

When you enable Firehose data transformation, Firehose buffers incoming data and invokes the specified Lambda function with each buffered batch asynchronously.

To get you started, we provide the following Lambda blueprints, which you can adapt to suit your needs:

  • Apache Log to JSON
  • Apache Log to CSV
  • Syslog to JSON
  • Syslog to CSV
  • General Firehose Processing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

EMR HDFS Parameters

A
  1. Replication factor - 3 times
  2. Block Size: 64 MB - 256 MB
  3. Replication factor can be configured in hdfs-site.xml
  4. Block size and Replication factor set per file
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

ES Stability

A

3 master nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Redshift Data Loading - Data Format

A
  1. CSV
  2. Delimited
  3. Fixed Width
  4. JSON
  5. Avro
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Tracking Amazon Kinesis Streams Application State

A

For each Amazon Kinesis Streams application, the KCL uses a unique Amazon DynamoDB table to keep track of the application’s state. If your Amazon Kinesis Streams application receives provisioned-throughput exceptions, you should increase the provisioned throughput for the DynamoDB table.

For example, if your Amazon Kinesis Streams application does frequent checkpointing or operates on a stream that is composed of many shards, you might need more throughput.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Tracking Amazon Kinesis Streams Application State

A

For each Amazon Kinesis Streams application, the KCL uses a unique Amazon DynamoDB table to keep track of the application’s state. If your Amazon Kinesis Streams application receives provisioned-throughput exceptions, you should increase the provisioned throughput for the DynamoDB table.

For example, if your Amazon Kinesis Streams application does frequent check pointing or operates on a stream that is composed of many shards, you might need more throughput.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Redshift : CloudHSM Vs. KMS - Security

A

CloudHSM

  1. $16k/Year + $5K upfront
  2. Need to setup HA & Durability
  3. Single tenant
  4. Customer managed root of trust
  5. Symmetric & Asymmetric encryption
  6. International Common Criteria EAL4 and U.S. Government NIST FIPS 140.2

KMS

  1. Usage based procing
  2. Highly Available & Durable
  3. Multi-tenant
  4. AWS managed root of trust
  5. Symmetric encryption only
  6. Auditing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Machine Learning Summary

A
  1. Binary - AUC (Area Under the Curve) - True Positives, True Negatives, False Positives and False Negatives
    Only model that can be fine tuned by adjusting the score threshold
  2. Multiclass - Confusion Matrix - (Correct Predictions & Incorrect Predictions)
  3. Regression - RMSE (Root Mean Square Error) - lower the RMSE the prediction is better
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Kinesis Streams - Load/Get Data Options

A
  1. Kinesis Producer Library - Producers
  2. Kinesis Client Library - KCL
  3. Kinesis Agent
  4. Kinesis REST API
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Redshift - Vacuum - Best Practices

A
  1. Vacuum is I/O sensitive
  2. Perform Vacuum after bulk deletes, data loading or after updates
  3. Perform Vacuum during lower period of activity or during your maintenance windows
  4. Vacuum utility is not recommended for tables over 700GB
  5. Don’t execute Vacuum
    Loading data is sort order
    Use time series table
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

WLM - Type of Groups

A
  1. User Group

2. Query Group

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Redshift Table Design - Constraints

A

Maintain data integrity

Types for constraints

  1. Primary Key
  2. Unique
  3. Nut null/null
  4. References
  5. Foreign Key

Except NotNull/Null we can’t enforce any constraints

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Hunk

A

Hunk is a web-based interactive data analytics
platform for rapidly exploring, analysing and
visualizing data in Hadoop and NoSQL data stores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Types of Analysis

A
  1. Pre-processing: filtering, transformations
  2. Basic Analytics: Simple counts, aggregates over windows
  3. Advanced Analytics: Detecting anomalies, event correlation
  4. Post-processing: Alerting, triggering, final filters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Jupyter Notebook

A

Jupyter is a web-based notebook for running Python,
R, Scala and other languages to process and visualize
data, perform statistical analysis, and train and run
machine learning models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Kinesis Streams - Kinesis Connectors available for

A

DynamoDB
S3
Elastisearch
Redshift

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Redshift Important System Tables

A
  1. STL_LOAD_ERRORS

2. STL_LOADERROR_DETAIL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Apache Ranger

A

Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Features include a centralized security administration, fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN) and central auditing. It uses agents to sync policies and users, and plugins that run within the same process as the Hadoop component, for example, NameNode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Redshift - Encryption in Transit

A

Redshift - Encryption in transit

  1. Create parameter group
  2. SSL Certificate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Machine Learning Use Cases

A
  1. Fraud Detection
  2. Customer Service
  3. Litigation/Legal
  4. Security
  5. Healthcare
  6. Sports
54
Q

Kinesis Firehose Important Parameters

A

Buffer Size - 1 MB - 128 MB
Buffer Interval - 60-900 Seconds

Parameters for transformation:

  1. Record ID
  2. Result : OK, Dropped & Processing Failed
  3. Data
55
Q

Oozie

A

Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

56
Q

Redshift - Encryption at Rest

A
  1. KMS
  2. HSM (CloudHSM & On-prem HSM)

Note: In Redshift it will encrypt Data Blocks, system Metadata & snapshots

57
Q

Tez

A
  • Tez is an engine to process complex Directed Acyclic
    Graph (DAG)
  • It can be used in place of Hadoop for running Pig and
    Hive
  • Runs on top of YARN
58
Q

SQS vs Kinesis Streams

A

SQS - Message Queue

Kinesis Streams - Real time processing

59
Q

DynamoDB Performance Points

A
  1. Use GSI
  2. Use Burst /spread periodic batch writes/SQS managed write buffer - In case of uneven writes
  3. Use Cashing - In case of uneven Read
60
Q

Redshift features

A
  1. Petabyte scale data warehouse services
  2. OLAP & BI Use cases
  3. ANSI SQL Compliance
  4. Column Oriented
  5. MPP Architecture
  6. Node Types:
    a. Dense Compute (DC1 and DC2)
    b. Dense Storage (DS2)

Single AZ Implementation

61
Q

Machine Learning Limits

A

Max observation size (target+attributes): 100KB
Max training data size: 100GB
Max batch predictions data size: 1TB
Max batch predictions data records: 100 million
Max columns in schema: 1000
Real-time prediction endpoint TPS: 200
Number of classes for multiclass ML models: 100

62
Q

Kinesis: key features

A

Kinesis : Key Features

  1. Real time data streaming
  2. Ordered record delivery
  3. Replication to three availability zone
  4. de-coupled from consuming application
  5. replay data
  6. zero downtime scaling
  7. pay as you go
  8. parallel processing - multiple producers and consumers
63
Q

EMR - S3DistCP

A

In the Hadoop ecosystem, DistCp is often used to move data. DistCp provides a distributed copy capability built on top of a MapReduce framework. S3DistCp is an extension to DistCp that is optimized to work with S3 and that adds several useful features.

  1. Copy or move files without transformation
  2. Copy and change file compression on the fly
  3. Copy files incrementally
  4. Copy multiple folders in one job
  5. Aggregate files based on a pattern
  6. Upload files larger than 1 TB in size
  7. Submit a S3DistCp step to an EMR cluster
64
Q

Spark Components

A
  1. Spark Core - Dispatch & Scheduling tasks
  2. Spark SQL - Execute low-latency interactive SQL query against structured data
  3. Spark Streaming - Stream processing of live data streams
  4. MLib - Scalable Machine Learning Library
  5. GraphX - Graphs parallel computation
65
Q

Flume

A

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tuneable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

66
Q

Redshift Cluster Resizing

A
  1. Creation of Destination Cluster
  2. Source cluster restart and enter into read-only mode
  3. Reconnect to source cluster to run queries in the read-only mode
  4. Redshift start copy data from source to target cluster
  5. Once copy is over, Redshift updates the DNS endpoint to target cluster
  6. Source cluster will be decommissioned
67
Q

SDK Use Cases

A

Low rate producers
Mobile apps
IoT devices
Web clients

68
Q

Redshift Table Design - Distribution Style

A
  1. Even
    Rows distributed across the slice regardless of value in a particular column
    Default Distribution style
  2. Key
    Distribute data evenly among slices
    Collocate matching raws in the same slice
    Improve the performace

Use cases - Join tables, larger fact tables

All

Copy of entire table is stored in all nodes
Need more space due to duplication
Use Case:
  - Static Data
  - Small size of table
  - No common distribution key
69
Q

Redshift - Slices Guidelines

A

No. of data files should be equal to no. of slices or multiple of the no. of slices

i.e. 4 slices = 4 files or 8 files
32 slices = 32 files or 64 files

File compression - gzip, lzop, bzip2

70
Q

Glacier - Vault Lock Policy

A

Amazon Glacier Vault Lock allows you to easily deploy and enforce compliance controls for individual Amazon Glacier vaults with a vault lock policy.

Use Cases

  1. Time based retention
  2. Undeletable
71
Q

Kinesis Analytics Use Cases

A
  • Mobile app live monitoring
  • Clickstream analytics
  • Logs
  • Metering records
  • IoT data
72
Q

EMR - Resizing Cluster

A
  1. Manually - Options
    1. Terminate at instance hour
    2. Terminate at task completion
  2. Autoscaling
73
Q

How to decide between more nodes vs. bigger nodes generally - Amazon Redshift ?

A

Bigger nodes are better for long running queries
eg: fewer dc1.8xlarge better than more dc1.large

More nodes are better for short running queries
eg: more dc1.large better than fewer dc1.8xlarge

74
Q

WLM Settings

A
1. Reboot of RedShift Cluster required to reflect the changes 
User Group
Query Group
User Group Wildcard
Query Group Wildcard
  1. No reboot required for parameters

Concurrency
% of Memory
Timeout

75
Q

What are the different types of KPL Batching?

A
  1. Collection - Group of stream records and batching them to reduce HTTP requests
  2. Aggregation - Allows you to combine multiple user records into a single stream
76
Q

Data lake

A

S3

77
Q

Kinesis Stream - KPL Anti-pattern

A

In case producer application/use case can’t incur an additional processing delay

78
Q

DynamoDB - Integration with AWS Services

A
  1. Redshift - Copy command to transfer the data
  2. EMR - Using HIVE can read & write data from DynamoDB
  3. S3 - Export & Import to S3
  4. Datapipeline - Mediator to copy the data to and from S3
  5. Lambda - Event based action
  6. Kinesis streams - Streaming Data
  7. EC2 Instance - Streaming Data
79
Q

HIVE Patterns

A
  1. Process and analyse logs
  2. Join very large tables
  3. Batch jobs
  4. Ad-hoc interactive queries
80
Q

Difference between Supervised and Unsupervised Learning

A

Unsupervised Learning:

  1. Unlabeled Data
  2. No knowledge of output
  3. Self guided learning algorithm
  4. Aim: to figure out the data patterns and grouping

Supervised Learning:

  1. Labelled Data
  2. Desired outcome is known
  3. Providing the algorithm training data to learn from
  4. Aim: Predictive Analytics
81
Q

Sqoop

A

Sqoop is a tool for data migration between Amazon
S3, Hadoop, HDFS, and RDBMS databases including
Redshift
- Parallel data transfer for faster export and ingestion
- Batched transfer, not meant for interactive queries

82
Q

Redshift Data Model

A

Redshift Data Model
1. Star Schema - Consist of one or more fact tables referencing any number of dimension tables

  • Fact Table - Consists of measurements metrics of fact of a business process
  • Dimension Table- Stores dimensions that describes objects in the fact table
83
Q

EMR - Data at Rest Encryption

A
  1. EC2 Cluster Nodes
    a. Open Source HDFS Encryption
    b. LUKS Encryption
  2. Foe EMRFS on S3
    a. SSE-S3
    b. SSE-KMS
    c. CSE-KMS
    d. CSE- Custom
84
Q

Quicksight Visualizations

A

Quicksight Visualizations

  • 20 visuals per analysis
  • Quicksight can determine most appropriate visual types for you
  • Dimensions & Measures (fields)
85
Q

Lambda Patterns

A

Lambda Patterns

  • Real-time file processing
  • Real-time stream processing
  • Extract, transform, and load
  • Replace cron
  • Process AWS events
86
Q

SQS Features

A
  • 256 KB Messages
  • Messages can be retained for 14 Days
  • Two important Architectures
    1. SQS Priority Architecture
    2. Fanout Architecture
87
Q

EMR Security

A

Controls:

  1. Security Groups
    a. Default & b. EMR Managed
  2. IAM Roles
    • Default Role, EC2 Default Role & Autoscaling Default Role
  3. Private Subnet
  4. Encryption at Rest
  5. Encryption in transmit
88
Q

EMR Anti-Patterns

A

Small data sets – Amazon EMR is built for massive parallel processing; if your data set is small enough to run quickly on a single machine, in a single thread, the added overhead to map and reduce jobs may not be worth it for small data sets that can easily be processed in memory on a single system.

ACID transaction requirements – While there are ways to achieve ACID (atomicity, consistency, isolation, durability) or limited ACID on Hadoop, another database, such as Amazon RDS or relational database running on Amazon EC2, may be a better option for workloads with stringent requirements.
Amazon

89
Q

Kinesis Streams Features

A

Kinesis Streams Features:

  • Streams receive data from the Producers
  • Replicate data over multiple availability zones for durability
  • Distribute data among the provisioned shards
90
Q

EMR HDFS Parameters

A
  1. Replication factor - 3 times
  2. Block Size - 64 MB - 256 MB
  3. Replication factor can be configured in hdfs-site.xml
  4. Block size and Replication factor set per file
91
Q

EMR File Formats

A

EMR File Formats

  1. Text
  2. Parquest
  3. ORC
  4. Sequence
  5. AVRO

Keep GZIP files 1-2 GB Range
Avoid smaller files <100 MB
s3Distcp can be used to copy data between S3 HDFS or viceversa

92
Q

IoT Authentication

A

IoT Authentication:

  1. X.509 Certificate
  2. Cognito Identity
93
Q

EMR Storage Options

A
  1. Instance Store
  2. EBS for HDFS
  3. EMRFS - S3

EMRFS & HDFS can be used together
Copy data from S3 to HDFS using S3DistCP

94
Q

Data Pipeline Components

A
  1. Data Nodes
  2. Activities
  3. Preconditions
  4. Schedules
95
Q

Redshift Table Design - Compression

A
  1. Automatic - Recommended by AWS
  2. Manual
    Use “Encode” to compress column
96
Q

Kinesis Streams - Best Practices

A
  • Start off with multiple shards
  • Have multiple consumers for A/B testing without downtime
  • Dump data to S3 when possible; it’s cheap and durable
  • Use the same stream for data archival and analytics
  • Lambda for transformations and processing
  • Use logic in consumer if you need only-once delivery; keep state in DynamoDB
  • Tag streams for cost segregation
97
Q

Redshift Deep Copy

A

A deep copy recreates and repopulates a table by using a bulk insert, which automatically
sorts the table. If a table has a large unsorted region, a deep copy is much faster than a
vacuum. The trade o􀃗 is that you cannot make concurrent updates during a deep copy
operation, which you can do during a vacuum.

Options:
1. To perform a deep copy using the original table DDL
2. To perform a deep copy using CREATE TABLE LIKE
3. To perform a deep copy by creating a temporary table and truncating the original
table

Note: 1st method is preferred over others 2

98
Q

Redshift - Encryption Keys Hierarchy

A
  1. Master Key
  2. Cluster Encryption Key
  3. Database Encryption Key
  4. Data Encryption Key
99
Q

Redshift Data Loading - Manifest

A
  1. Load required files only
  2. Load files from different bucket
  3. Load files with different prefix
  4. JSON format
100
Q

Spark on EMR

A
  1. Spark framework replaces MapReduce framework
  2. Spark processing engine will be deployed in each node of cluster
  3. Spark SQL can interact with S3 or HDFS
101
Q

Redshift WLM Features

A

Redshift WLM Features

  1. Manages separate queue for long running and short running queries
  2. Configure memory allocation to queues
  3. Improve performance & expenses
102
Q

Which data Ingestion Tool is similar to Kinesis?

A

Kafta

103
Q

Kinesis - Producers

A
  • Producers add data records to Kinesis streams
  • A data record must contain:
    1. Name of the stream
    2. Partition Key
    3. Data Content
  • Single data records can be added using the -PutRecord API
  • Multiple data records can be added at one time using the PutRecords API
104
Q

KCL - Features

A

Consumes and processes data from an Amazon Kinesis stream
KCL Libraries available for Java, Ruby, Node, Go, and a Multi-Lang Implementation with Native Python support
Creates a DynamoDB table (with the same name as your application) to manage state
Make sure you don’t have name conflicts with any existing DynamoDB table and your app name (same region)
Multiple KCLs can seamlessly works on the same or different
streams
Checkpoints processed records
KCLs can load balance among each other
Automatically deal with stream scaling like shard splits and merges

Key performance indicators of the KCL like records processed (size, count, latency, age)
MillisBehindLatest - How far behind the KCL is

105
Q

What is Mahout?

A

Mahout is a machine learning library with tools for
clustering, classification, and several types of
recommenders, including tools to calculate most similar
items or build item recommendations for users

106
Q

Redshift Important Performance Metrics

A

Redshift Important Performance Metrics:

  1. Number of nodes, processors or slices
  2. Node Types
  3. Data Distribution
  4. Data Sort Order
  5. Dataset size
  6. Concurrent Operations
  7. Query Structure
  8. Code Compilation
107
Q

EMR Important Web Interfaces

A

YARN ResourceManager http://master-public-dns-name:8088/
YARN NodeManager http://slave-public-dns-name:8042/
Hadoop HDFS NameNode http://master-public-dns-name:50070/
Hadoop HDFS DataNode http://slave-public-dns-name:50075/
Spark HistoryServer http://master-public-dns-name:18080/

108
Q

What are the differences between LSI and GSI?

A

Global secondary index (GSI.html) — an index with a partition key and a sort key that can be different from those on the base table.

  • A global secondary index is considered “global” because queries on the index can span all of the data in the base table, across all partitions.
  • It can be created any time past table creation
  • Not shares RCU & WCU with tables

LSI - an index that has the same partition key as the
base table, but a di􀃗erent sort key.
- A local secondary index is “local” in the sense that
every partition of a local secondary index is scoped to a base table partition that has the same partition key value.
- It can be only created during table creation
- Shares RCU & WCU with tables

109
Q

Kinesis Streams - Shard Capacity

A

Kinesis Streams - Shard Capacity:

  • 1 MB/sec Data Input
  • 2 MB/sec Data Output
  • 5 transactions/sec for read
  • 1000 records/sec for writes
110
Q

Which API is used to add:

1) Single Data Records?
2) Multiple Data Records?

A

1) PutRecord for single data records

2) PutRecords for multiple data records

111
Q

What is Apache Ranger?

A

Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Features include a centralized security administration, fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN) and central auditing. It uses agents to sync policies and users, and plugins that run within the same process as the Hadoop component, for example, NameNode.

112
Q

Redshift unloading data - Encryption Option

A
  1. SSE-S3
  2. SSE-KMS
  3. CSE-CMK
113
Q

What is Kinesis Analytics?

A
  1. Amazon Kinesis Analytics enables users to run standard SQL queries over live streaming data
  2. Readily query Kinesis Streams/Firehose data and export the output to destinations like S3
114
Q

EMR Use Cases

A
  1. Log Processing/Analytics
  2. ETL
  3. Clickstream Analytics
  4. Machine Learning
115
Q

What is Spark RDD?

A

Resilient Distributed Dataset (RDD) is core of Spark

116
Q

Binary Classification Model

A
  • To predict binary outcome
  • AUC (Area under curve) measures the prediction accuracy of model(0 to 1)
  • Important parameters (Histogram, cut-off threshold)
117
Q

Kinesis Producer Library (KPL)

A
  • API
  • Multiple streams
  • Multithread (for multicore)
  • Synchronous and asynchronous
  • Complement to KCL (kinesis client library)
  • Cloudwatch - records in/out/error
118
Q

Multiclass Classification Model

A
  • To generate predictions for multiple classes
  • F1 score measures quality of a model (0 to 1)
  • Confusion Matrix is used
119
Q

HIVE on EMR Integration

A

S3

DynamoDB

120
Q

What is Elastisearch?

A
  • It is distributed multi tenant-capable full text search engine
  • HTTP Web Interface
  • It can be integrated with Logstash & Kibana
    ELK Stack
    1. Logstash - Data collection & log parsing engine
    2. Kibana - Open source data visualization and exploration tool
121
Q

EMR - Long Running vs Transient Cluster

A

Long Running -

  1. Cluster stays up & running for queries against HBASE
  2. Jobs on the cluster run frequently

Transient Cluster -

  1. Temporary cluster that shuts down after processing
  2. Good use case is Batch Job
122
Q

Quicksight Visualization Types

A
  1. Autograph
  2. Bar-chart - Vertical & Horizontal
  3. Line Charts
    - Gross sales by month
    - Gross sales and net sales by month
    - Measure of a dimension over a period of time
  4. Pivot Table
    - A way to summarize data
  5. Scaler Plot
    - Two or Three measures of a dimension
  6. Tree Map
    - One to two measure for a domain
  7. Pie Chart
    - Compare values for diff dimensions
  8. Heat Map
    - Identify trends & outliers
  9. Story
    - Create narrative by presenting iteration
  10. Dashboard
    - Read only snapshot of analysis
123
Q

Redshift - Cross Region Snapshots

A

Cross Region KMS Encrypted Snapshots for KMS encrypted clusters

  1. Snapshot encrypted
124
Q

Redshift Table Design - Key Factors

A
  1. Architecture
  2. Distribution Styles
  3. Sort Keys
  4. Compression
  5. Constraints
  6. Column Sizing
  7. Data Types
125
Q

What is Hcatalog?

A

Hcatalog is a table storage manager for Hadoop
It can store data in any format and make it
available to external systems like Hive and Pig
It can write files in many formats like RCFile,
CSV, JSON, and SequenceFile, and ORC or
custom formats

126
Q

What is Redshift Vacuum?

A

Vacuum helps to recover the space and sort the table.

Vacuum Options: Full, Sort, Delete
Note: In case of any update and deletion of any row Redshift will not free up the space

127
Q

What is Pig?

A

Pig is an open source analytics package that runs on top of Hadoop. Pig is operated by Pig Latin, a SQL-like language which allows users to structure, summarize, and query data.

128
Q

Elastisearch Use Cases

A
  1. Logging & Analysis
  2. Distributed document store
  3. Realtime application monitoring
  4. Clickstream weblog ingestion
129
Q

Redshift Table - Sort Keys

A
  1. Single
  2. Compound
  3. Interleaved
130
Q

What is Zepplin?

A
- Zeppelin is a web-based notebook that enables
interactive data analytics
- Ingestion, discovery, analytics, visualization and
collaboration
- Connectors for
   HDFS/Hbase/Hive/Spark
   Flink
   PostgreSQL/Redshift
   ElasticSearch