Data & Analytics Flashcards by Leo Crosa

Which Service is built on the Presto engine?

Athena

How well did you know this?

Not at all

Perfectly

What is amazon Athena?

It’s a serverless SQL query service to analyze data stored in S3, without moving it.

You can also analyze data from other databases by using data source connectors. Like relational or non relational dbs, and custom data sources like on-premises.

How well did you know this?

Not at all

Perfectly

What do you pay for in Athena?

5$ per TB of data Scanned.

How well did you know this?

Not at all

Perfectly

Which service is athene used in conjunction with?

Commonly used with Amazon Quicksight for reporting and dashboards.

How well did you know this?

Not at all

Perfectly

What is the best service to perform Analytics over your S3 data?

Athena.

How well did you know this?

Not at all

Perfectly

What are Athena use cases?

To perform analytics of services logs. Many service store their logs in S3. For example VPC flow logs, ELB logs, Cloudtrail trails, etc.

How well did you know this?

Not at all

Perfectly

What to use if I want to analyze data in S3 using serverless SQL?

Athena

How well did you know this?

Not at all

Perfectly

What do you pay for in Athena?

The amount of data scanned

How well did you know this?

Not at all

Perfectly

How can you save costs when using athena?

1) only scan columnar data:

Since with Athena you pay for the data you scan, you can save money by sorting data un a columnar way so you don’t need to scan every row of the data, only the columns you need. This way you scan less data and save money.
You will need to use the “Glue” service to transform data into columnar data. The columnar data formats we can use are Parquet or ORC.

2) Compress the data: Compress data to scan less and save money.

3) Partition your S3 datasets to be able to scan only what you need.

How well did you know this?

Not at all

Perfectly

What is Glue?

An AWS Service to transform data from CSV to parquet or ORC.

How well did you know this?

Not at all

Perfectly

How can you use Glue with Athena?

You can first transform data into columnar with glue, and then scan only the columns you need with athena and save costs by analyzing less data.

How well did you know this?

Not at all

Perfectly

What are data source connectors?

An option of Athena that relies on lambda functions, and allows you to analyze federated sources of databases, like any database from AWS or external databases too.

It stores the data in S3 and allows it to be analyzed by athena.

How well did you know this?

Not at all

Perfectly

How does athena work?

It creates a serverless SQL database, and queries an s3 bucket for its data, filling this database with the bucket object data. You have to set up the query in SQL language with the variables you want to get from the s3 bucket.

How well did you know this?

Not at all

Perfectly

What is redshift?

A database service based on postgresql, used for analytics and data warehousing.

How well did you know this?

Not at all

Perfectly

What is data warehousing?

A database that is adequated for storing large volumes of data coming from many different sources, and perform analytics on it.

How well did you know this?

Not at all

Perfectly

Is redshift serverless?

It actually has 2 modes, a serverless mode and a provisioned mode.

How well did you know this?

Not at all

Perfectly

What kind of database is redshift?

its a custom postre that stores columnar data. It’s used for analytics and data warehousing.

How well did you know this?

Not at all

Perfectly

How do redshift and athena compare?

Athena’s data lives in s3 and is serverless.

With Redshift you need to deploy a cluster, but its a lot faster than athena.

How well did you know this?

Not at all

Perfectly

What is a redshift cluster composed of?

Leader node: result aggregation and query planning
Compute node/s: For performing the queries. and sends results to the leader node.

(Queries are performed on data that is already loaded into redshift from s3. Different than when using spectrum, which queries directly from s3, using spectrum nodes).

Provision mode lets you choose instance types and reserved instances to save money.

Serverless mode is managed by aws.

How well did you know this?

Not at all

Perfectly

How do you perform disaster recovery of redshift?

Redshift backups are incremental snapshots sent to s3.

You can restore a snapshot into a new cluster.

You can configure redshift to copy snapshots to a different region and in case of a region failure you can restore the snapshot into a new redshift cluster in another region.

How well did you know this?

Not at all

Perfectly

In which ways can you inject data into redshift?

From kinesis data firehose. Firehose will receive data from other sources, and store it into an s3 bucket. Then redshift will perform an s3 copy from this bucket to a redshift database.

Or you can copy manually with redshift from an s3 bucket.

How well did you know this?

Not at all

Perfectly

What is redshift spectrum?

Normally with redshift, you would copy data from s3 into redshift to analyze it.

With the redshift spectrum feature you can query data with an existing redshift cluster, without loading it into redshift.

How well did you know this?

Not at all

Perfectly

What are spectrum nodes?

Nodes used by the redshift spectrum feature to query data directly from s3 without loading it into the redshift database.

Spectrum nodes are not part of your redshift cluster. They work behind the scenes when you query data directly from s3 with this feature.

How well did you know this?

Not at all

Perfectly

What is opensearch?

A service for querying and analyzing data. Its commonly used complementary with other databases.

Opensearch is great for searching fileds with partial matches.

How well did you know this?

Not at all

Perfectly

How do you provision opensearch?

You have 2 options. Managed mode and serverless mode. With managed mode actual EC2 instances will be provisioned.

What is opensearch for?

You can use opensearch to query other databases, for partial data. Then you can retrieve that data from its original database once you found it with opensearch. TLDR: It helps you find data in other databases. That is why its an analytics DB.

Which database is best for fast searches of partial data?

Opensearch

Which database is best for analysis of large, complex, long term data?

Redshift

Which services can opensearch be used with?

Kinesis data streams, kinesis firehose, in conjunction with lambda functions. Cloudwatch logs. DynamoDB with dynamodb streams and lambda.

Which Database service is great for analysing Big Data?

EMR: Elastic MapReduce. For analysis and processing of big data.

What does EMR infrastructure look like?

It's made of clusters. An EMR cluster can be made of hundreds of EC2 instances. Node types: Master node: Manages the cluster. Core node: Runs tasks and stores data. Task node: It's optionall, for running tasks only. For scaling tasks and usually made of spot instances.

What does EMR simplify?

It comes bundled with apache stark, hbase, presto, flink, etc. These are opensource tools for big data, and are difficult and time consuming to set up. EMR simplifies all that by including these tools.

What could you use spot instances for when running EMR?

For task nodes, which are

What is amazon quicksight and what is it for?

A service that takes data from other databases to create interactive dashboards.

What are quicksight use cases?

Business analytics. Get business insight using data.

What data sources can you use with quicksight?

RDS, Aurora, Athena, Redshift, S3, opensearch, timestream. Imports with SPICE: xls, csv, json, etc. Third party databases like onpremises dbs. And third party data sources like salesforce, jira, etc.

What is SPICE engine?

A quicksight feature that you can use if instead of taking data from anotherdatabase, you import data directly into quicksight.

What is the aws quicksight enterprise version?

A more expensive more feature version than the standard. Allows you to have groups of users. (With standard only users). These are only for quicksight, unrelated to IAM. It also allows you to hide certain columns from users with less privileges. This is called column level security: CLS.

How does aws quicksight work?

You analyze data in a certain way, with filters, etc, and you get a read-only dashboard of it, which you can then publish and share with other quicksight users.

What is glue useful for? What are the services characteristics?

It's ETL: You extract data from a source, transform it, then load it into another databas for analysis. It's fully serverless. For example: Extract data from s3 or RDS, transform it so redshift can analyze it, and then load it into redshift. Another example is transforming data into parquet format (columnar) to be used by athena.

What is the glue data catalog?

A glue feature. By using "glue data crawlers" glue reads different aws databases, and writes metadata of tables, columns, data types, etc, into the glue data catalog.

Why is the glue data catalog important?

Other analytics services use the glue data crawlers and catalog behind the scenes to discover data themselves. For example Atherna, Redshift, and EMR.

What are glue job bookmarks?

A feature of glue to prevent reprocessing old data.

What is aws lake formation?

Fully aws managed service that lets you set up data lakes in just days. Data lakes are central places for your data for analytic purposes.

What are lake formation features?

Discover, cleanse, transform, and ingest data into your data lake. It has features to automate many of these things to let you set up your data lake faster. Integrated with other aws databases.

Where is the lake formation datalake stored?

Which aws database service has fine grained access control?

Lake formation.

What are lake formation blueprints?

These are templates that the service provides made for the different sources where you can gather data from, to build your data lake. They facilitate the collection and extraction of data, making the lake formation faster.

Which services is built on aws glue?

Lake formation. Has crawlers, ETL functions, data catalogs, features which come from glue.

What is the purpose of lake formation?

You build a data lake for this data to be analyzed by other services. Services that leverage cloud formation are athena, redshift, EMR etc. When working with lake formation, you will do it through these other analytics services.

Why in a security sense would you use lake formation?

It's a way to centralize access control. Lake formation has these capabilities, and you save the trouble of building access control in each analytics service. Instead you set it up in lake formation, and do all your analytiks on a data lake through lake formation.

Which services lets you centralize access control to your databases?

Lake formation.

What are the 2 options of analysis when using kinesis data analytics?

Apache flink, or SQL applications.

What is MSK?

An analytics service for apache kafka. It's an alternative to kinesis. It's fully managed. It has a provisioned mode and a serverless mode.

You would like to have a database that is efficient at performing analytical queries on large sets of columnar data. You would like to connect to this Data Warehouse using a reporting and dashboard tool such as Amazon QuickSight. Which AWS technology do you recommend?

Redshift

You have a lot of log files stored in an S3 bucket that you want to perform a quick analysis, if possible Serverless, to filter the logs and find users that attempted to make an unauthorized action. Which AWS service allows you to do so?

Athena

You are running a gaming website that is using DynamoDB as its data store. Users have been asking for a search feature to find other gamers by name, with partial matches if possible. Which AWS technology do you recommend to implement this feature?

OpenSearch

As a Solutions Architect, you have been instructed you to prepare a disaster recovery plan for a Redshift cluster. What should you do?

Enable Automated Snapshots, then configure your Redshift Cluster to automatically copy snapshots to another region.

Which feature in Redshift forces all COPY and UNLOAD traffic moving between your cluster and data repositories through your VPCs?

Enhanced VPC Routing

A company is using AWS to host its public websites and internal applications. Those different websites and applications generate a lot of logs and traces. There is a requirement to centrally store those logs and efficiently search and analyze those logs in real-time for detection of any errors and if there is a threat. Which AWS service can help them efficiently store and analyze logs?

Opensearch

An AWS service allows you to create, run, and monitor ETL (extract, transform, and load) jobs in a few clicks.

Glue

……………………….. makes it easy and cost-effective for data engineers and analysts to run applications built using open source big data frameworks such as Apache Spark, Hive, or Presto without having to operate or manage clusters.

EMR

An e-commerce company has all its historical data such as orders, customers, revenues, and sales for the previous years hosted on a Redshift cluster. There is a requirement to generate some dashboards and reports indicating the revenues from the previous years and the total sales, so it will be easy to define the requirements for the next year. The DevOps team is assigned to find an AWS service that can help define those dashboards and have native integration with Redshift. Which AWS service is best suited?

Quicksight

Which AWS Glue feature allows you to save and track the data that has already been processed during a previous run of a Glue ETL job?

Glue Jpb Bookmarks. To avoid reprocessing data.

You are a DevOps engineer in a machine learning company with 3 TB of JSON files stored in an S3 bucket. There’s a requirement to do some analytics on those files using Amazon Athena and you have been tasked to find a way to convert those files’ format from JSON to Apache Parquet. Which AWS service is best suited?

AWS GLUE

You have an on-premises application that is used together with an on-premises Apache Kafka to receive a stream of clickstream events from multiple websites. You have been tasked to migrate this application as soon as possible without any code changes. You decided to host the application on an EC2 instance. What is the best option you recommend to migrate Apache Kafka?

MSK

You have data stored in RDS, S3 buckets and you are using AWS Lake Formation as a data lake to collect, move and catalog data so you can do some analytics. You have a lot of big data and ML engineers in the company and you want to control access to part of the data as it might contain sensitive information. What can you use?

Lake Formation Fine-Grained Access Control

Which AWS service is most appropriate when you want to perform real-time analytics on streams of data?

Kinesis Data Analytics

Data & Analytics Flashcards

(68 cards)