Datawarehousing Flashcards
Is Redshift good for ELT?
Yes
Can Lambda Expression be trigged by IOT?
Yes
Can Lambda Expression be trigged by Kinesis?
Yes
Can Apache Spark notebooks run on EMR?
Yes
Can Apache Spark read from S3?
Yes
Can Apache Zeppelin be used to visualize data in Amazon Redshift?
Yes
Is Redshift a columnar database?
Yes
Is Redshift MPP?
Yes
Is Redshift ANSI SQL Compliant?
Yes
In addition, to data compression and columnar storage, how is I/O reduced in Redshift?
Zone maps : A zone map exists for each 1 MB block, and consists of in-memory metadata that tracks the minimum and maximum values within the block, Hence if you sort the column e.g. a date_column If it is sorted then it will be faster to find the block in which data is stored. Amazon redshift does not use indexes as any conventional database.
Can Redshift Clusters be managed via API?
Yes
Does redshift support ODBC and JDBC?
Yes
Describe Redshift architecture?
1 Leader Node. Communicating to multiple Compute nodes that house the data
Does Redshift encrypt data at rest?
Yes AES-256
Does Amazon Redshift take care of key management?
Yes
Anti-Patterns for Redshift
Small datasets, OLTP, Unstructured data, BLOB data
What are the 2 methods used by Kinesis Firehouse?
PutRecord and PutRecordBatch
What is the max size for a Firehouse PutRecord?
1000 Kb
Kinesis Agent
Java agent is a stand-alone software which can send information to Kinesis and Kinesis Firehose. It can be installed on Linux servers
Can the Kinesis Agent monitor multiple files and write to multiple streams?
Yes
What is the max buffer size for Kinesis Firehose?
3Mb
Can Kinesis Firehouse invoke a Lambda Function?
Yes
Why should a record separator be added to Kinesis Stream data?
Kinesis stream bundles records together. If you don’t add a record separator, you can’t split the records later.
What are buffer sizes for S3?
1 MB - 128 MB
What are the buffer intervals for S3?
60 to 900 Seconds
Can Kinesis Firehouse dynamically raise the buffer size?
Yes
What does the Redshift copy command do?
Copies data from dynamoDB or S3 into Redshift existing table
Before you send a record to Kinesis Firehouse, what do you need to do?
Flatten the record and make sure it is in UTF-8 encoded into a single JSON object
What is the elastic search buffer size range?
1 MB to 100 MB
What is the buffer interval for elastic search
60 to 900 seconds
Describe Kinesis Analytics
A SQL based query that can aggregate data in a stream and output to a kinesis stream or a lambda function
What is the maximum time a Lambda Function can run?
5 minutes
How do Kinesis Stream and Kinesis Firehose differ?
Kinesis Streams. The more customizable option, Streams is best suited for developers building custom applications or streaming data for specialized needs. The customizability of the approach, however, requires manual scaling and provisioning. Data typically is made available in a stream for 24 hours, but for an additional cost, users can gain data availability for up to seven days.
Kineses Firehose. The simpler approach, Firehose handles loading data streams directly into AWS products for processing. Scaling is handled automatically, up to gigabytes per second, and allows for batching, encrypting, and compressing. Firehose also allows for streaming to S3, Elasticsearch Service, or Redshift, where data can be copied for processing through additional services.
What are some destinations for Kinesis Analytics?
Firehouse, Streams, S3, Redshift, Elastic Search
Can data be enriched via Kinesis Stream?
Yes, but it must be stored in S3 and then an in-application reference table is created by Kinesis stream
What is a common use case for Kinesis Stream?
Read streaming data and analyze and aggregate it and drop to EMR or Redshift
Why would one use KPL and KPC?
KPL and KPC are the kinesis libraries that take care of load balancing, multi-threading, aggregatio and de-aggregation, retries, scaling, and other functionality not in the Kinesis API. They are placed between the produce and consumer programs and the streams.
How else is data placed into a Kinesis Stream?
Via the API or via an agent that is installed on each client. The agent monitors for file changes (e.g. log files)
What are the two modes of operation for the KPL?
Synchronous and Asynchronous?
Which mode is preferred practice?
Asynchronous
If you had to reduce end-to-end latency would you use KPL, Kinesis Agent, or the Kinesis API?
API?
What languages does Lambda support?
AWS Lambda supports code written in Node.js (JavaScript), Python, Java (Java 8 compatible), and C# (.NET Core) and Go. Your code can include existing libraries, even native ones.
What are the 3 ways provision your I/O in kinesis stream?
They can be provisioned in in 1 MB increments via API, Console, or SDK
What can you tell me about data in a Kinesis stream?
It is stored for 24 hours by default, and replicated across 3 AZs.
Ideal Patterns for Kinesis Stream?
Real-time data analytics, log and data intake and processing, Real-time metrics and reporting
Is a Kinesis stream made up of shards?
Yes
How many read transactions does each shard give you?
5
How many MB can 5 read transactions give you?
2 MB
How many writes per second can a shard support?
1000
A shard can support how much per second?
1 MB data written per second
What determines the data capacity of your stream?
The number of shards
Each shard can capture how many MB per second?
1 MB
Each shard can write how many MB per second?
2 MB
In case of failure, where can you store the cursor for Kinesis?
DynamoDB
What is kinesis storm spout?
The Amazon Kinesis Storm Spout helps developers use Amazon Kinesis with Storm, an open source, distributed real-time computation system. This version of the Amazon Kinesis Storm Spout fetches data from the Amazon Kinesis stream and emits it as tuples that Storm topologies can process. Developers can add the Spout to their existing Storm topologies, and leverage Amazon Kinesis as a reliable, scalable, stream capture, storage, and replay service that powers their Storm processing applications.
Name two anti-patterns for Kinesis?
Long term storage and small scale consistent throughput
Name 5 ideal patterns for lambda?
real-time processing, real-time file processing, cron, AWS events, ETL
What two modes can Lambda expressions function?
Synchronously and Asynchronously
What happens when a synchronously called Lambda function fails?
It throws an exception
What happens when an asynch lambda gets called and fails?
It gets called 3 times.
How many lambda functions can run concurrently per account?
100
What are the 3 anti-patterns for Lambda?
Long running apps. Dynamic websites. Stateful apps.
Ideal usage patterns?
log processing, ETL, Big Data, data mining
Is EMR fault-tolerant for code node failure?
Yes
Does EMR provision for failed slave nodes?
No
Amazon EMR with MapR distribution has what advantage?
No-name node architecture that can tolerate failure
Does EMR integrate with S3 and DynamoDB?
Yes
What is Spark?
An open-source analytics in-memory analytics engine?
What is Impala?
SQL for hadoop
What is Hbase?
An open-source distributed database running on top of hadoop
What is S3DispCP
Apache DistCp is an open-source tool you can use to copy large amounts of data.During a copy operation, S3DistCp stages a temporary copy of the output in HDFS on the cluster. S3DistCp is an extension of DistCp that is optimized to work with AWS, particularly Amazon S3.
What is EMRFS?
an implementation of HDFS on S3. You can enable client and server side encryption. Metadata is stored in dynamodb
Name 2 anti-patterns for EMR?
small data sets and ACID transactions
Name 2 anti-patterns for ML?
Very large dataset and unsupported learning tasks?
What is dynamodb streams?
DynamoDB Streams captures a time-ordered sequence of item-level modifications in any DynamoDB table, and stores this information in a log for up to 24 hours. … A DynamoDB stream is an ordered flow of information about changes to items in an Amazon DynamoDB table.
What is the limit of data storage for dynamo db?
None
What are the anti-patterns for dynamo db?
Joins, ad-hoc query, blobs, and large-data with low i/o rate
What service would you use for OLAP/BI?
Redshift because it has columnar storage. It is scaleable and works with BI tools
Where do Redshift clusters reside?
Within an AZ
Can Redshift clusters reside across multiple AZs?
If you set it up for replication manually, yes.
Name 4 Redshift anti-patterns.
ACID, BLOB, Unstructured and small datasets
What types of searches are done with Elastic Search?
Text, structure data, analytics
Is Elastic Search self-healing?
Failed clusters are replaced auto-magically
What does ES integrate with?
Logtash (log pipeline) and Kibana (Analytics and visualization)
Elastic Search suited for?
Log analysis, streaming data,
Elastic Search Anti-Patterns
OLTP and Petabyte Storage
Quicksight
Cloud powered-BI for visualization and ad-hoc queries
AWS Shield
managed DDoS
What is Cost Explorer?
Service that lets you gain insight into where costs are spent.
Spark Streaming
extends spark API can be installed on EMR.
SparkSQL
extends spark API allows SQL queries along side complex calculations