Datawarehousing Flashcards

Question

What are the buffer intervals for S3?

Answer 1

60 to 900 Seconds

Answer 2

Copies data from dynamoDB or S3 into Redshift existing table

Answer 3

Flatten the record and make sure it is in UTF-8 encoded into a single JSON object

Answer 4

1 MB to 100 MB

Answer 5

60 to 900 seconds

Answer 6

A SQL based query that can aggregate data in a stream and output to a kinesis stream or a lambda function

Answer 7

Kinesis Streams. The more customizable option, Streams is best suited for developers building custom applications or streaming data for specialized needs. The customizability of the approach, however, requires manual scaling and provisioning. Data typically is made available in a stream for 24 hours, but for an additional cost, users can gain data availability for up to seven days. Kineses Firehose. The simpler approach, Firehose handles loading data streams directly into AWS products for processing. Scaling is handled automatically, up to gigabytes per second, and allows for batching, encrypting, and compressing. Firehose also allows for streaming to S3, Elasticsearch Service, or Redshift, where data can be copied for processing through additional services.

Answer 8

Firehouse, Streams, S3, Redshift, Elastic Search

Answer 9

Yes, but it must be stored in S3 and then an in-application reference table is created by Kinesis stream

Answer 10

Read streaming data and analyze and aggregate it and drop to EMR or Redshift

Answer 11

KPL and KPC are the kinesis libraries that take care of load balancing, multi-threading, aggregatio and de-aggregation, retries, scaling, and other functionality not in the Kinesis API. They are placed between the produce and consumer programs and the streams.

Answer 12

Via the API or via an agent that is installed on each client. The agent monitors for file changes (e.g. log files)

Answer 13

Synchronous and Asynchronous?

Answer 14

Asynchronous

Answer 15

AWS Lambda supports code written in Node.js (JavaScript), Python, Java (Java 8 compatible), and C# (.NET Core) and Go. Your code can include existing libraries, even native ones.

Answer 16

They can be provisioned in in 1 MB increments via API, Console, or SDK

Answer 17

It is stored for 24 hours by default, and replicated across 3 AZs.

Answer 18

Real-time data analytics, log and data intake and processing, Real-time metrics and reporting

Answer 19

1 MB data written per second

Answer 20

The number of shards

Answer 21

The Amazon Kinesis Storm Spout helps developers use Amazon Kinesis with Storm, an open source, distributed real-time computation system. This version of the Amazon Kinesis Storm Spout fetches data from the Amazon Kinesis stream and emits it as tuples that Storm topologies can process. Developers can add the Spout to their existing Storm topologies, and leverage Amazon Kinesis as a reliable, scalable, stream capture, storage, and replay service that powers their Storm processing applications.

Answer 22

Long term storage and small scale consistent throughput

Answer 23

real-time processing, real-time file processing, cron, AWS events, ETL

Answer 24

Synchronously and Asynchronously

Answer 25

It throws an exception

Answer 26

It gets called 3 times.

Answer 27

Long running apps. Dynamic websites. Stateful apps.

Answer 28

log processing, ETL, Big Data, data mining

Answer 29

No-name node architecture that can tolerate failure

Answer 30

An open-source analytics in-memory analytics engine?

Answer 31

SQL for hadoop

Answer 32

An open-source distributed database running on top of hadoop

Answer 33

Apache DistCp is an open-source tool you can use to copy large amounts of data.During a copy operation, S3DistCp stages a temporary copy of the output in HDFS on the cluster. S3DistCp is an extension of DistCp that is optimized to work with AWS, particularly Amazon S3.

Answer 34

an implementation of HDFS on S3. You can enable client and server side encryption. Metadata is stored in dynamodb

Answer 35

small data sets and ACID transactions

Answer 36

Very large dataset and unsupported learning tasks?

Answer 37

DynamoDB Streams captures a time-ordered sequence of item-level modifications in any DynamoDB table, and stores this information in a log for up to 24 hours. ... A DynamoDB stream is an ordered flow of information about changes to items in an Amazon DynamoDB table.

Answer 38

Joins, ad-hoc query, blobs, and large-data with low i/o rate

Answer 39

Redshift because it has columnar storage. It is scaleable and works with BI tools

Answer 40

Within an AZ

Answer 41

If you set it up for replication manually, yes.

Answer 42

ACID, BLOB, Unstructured and small datasets

Answer 43

Text, structure data, analytics

Answer 44

Failed clusters are replaced auto-magically

Answer 45

Logtash (log pipeline) and Kibana (Analytics and visualization)

Answer 46

Log analysis, streaming data,

Answer 47

OLTP and Petabyte Storage

Answer 48

Cloud powered-BI for visualization and ad-hoc queries

Answer 49

managed DDoS

Answer 50

Service that lets you gain insight into where costs are spent.

Answer 51

extends spark API can be installed on EMR.

Answer 52

extends spark API allows SQL queries along side complex calculations

Datawarehousing Flashcards

(91 cards)