Data & Analytics Flashcards
Which Service is built on the Presto engine?
Athena
What is amazon Athena?
It’s a serverless SQL query service to analyze data stored in S3, without moving it.
You can also analyze data from other databases by using data source connectors. Like relational or non relational dbs, and custom data sources like on-premises.
What do you pay for in Athena?
5$ per TB of data Scanned.
Which service is athene used in conjunction with?
Commonly used with Amazon Quicksight for reporting and dashboards.
What is the best service to perform Analytics over your S3 data?
Athena.
What are Athena use cases?
To perform analytics of services logs. Many service store their logs in S3. For example VPC flow logs, ELB logs, Cloudtrail trails, etc.
What to use if I want to analyze data in S3 using serverless SQL?
Athena
What do you pay for in Athena?
The amount of data scanned
How can you save costs when using athena?
1) only scan columnar data:
Since with Athena you pay for the data you scan, you can save money by sorting data un a columnar way so you don’t need to scan every row of the data, only the columns you need. This way you scan less data and save money.
You will need to use the “Glue” service to transform data into columnar data. The columnar data formats we can use are Parquet or ORC.
2) Compress the data: Compress data to scan less and save money.
3) Partition your S3 datasets to be able to scan only what you need.
What is Glue?
An AWS Service to transform data from CSV to parquet or ORC.
How can you use Glue with Athena?
You can first transform data into columnar with glue, and then scan only the columns you need with athena and save costs by analyzing less data.
What are data source connectors?
An option of Athena that relies on lambda functions, and allows you to analyze federated sources of databases, like any database from AWS or external databases too.
It stores the data in S3 and allows it to be analyzed by athena.
How does athena work?
It creates a serverless SQL database, and queries an s3 bucket for its data, filling this database with the bucket object data. You have to set up the query in SQL language with the variables you want to get from the s3 bucket.
What is redshift?
A database service based on postgresql, used for analytics and data warehousing.
What is data warehousing?
A database that is adequated for storing large volumes of data coming from many different sources, and perform analytics on it.
Is redshift serverless?
It actually has 2 modes, a serverless mode and a provisioned mode.
What kind of database is redshift?
its a custom postre that stores columnar data. It’s used for analytics and data warehousing.
How do redshift and athena compare?
Athena’s data lives in s3 and is serverless.
With Redshift you need to deploy a cluster, but its a lot faster than athena.
What is a redshift cluster composed of?
Leader node: result aggregation and query planning
Compute node/s: For performing the queries. and sends results to the leader node.
(Queries are performed on data that is already loaded into redshift from s3. Different than when using spectrum, which queries directly from s3, using spectrum nodes).
Provision mode lets you choose instance types and reserved instances to save money.
Serverless mode is managed by aws.
How do you perform disaster recovery of redshift?
Redshift backups are incremental snapshots sent to s3.
You can restore a snapshot into a new cluster.
You can configure redshift to copy snapshots to a different region and in case of a region failure you can restore the snapshot into a new redshift cluster in another region.
In which ways can you inject data into redshift?
From kinesis data firehose. Firehose will receive data from other sources, and store it into an s3 bucket. Then redshift will perform an s3 copy from this bucket to a redshift database.
Or you can copy manually with redshift from an s3 bucket.
What is redshift spectrum?
Normally with redshift, you would copy data from s3 into redshift to analyze it.
With the redshift spectrum feature you can query data with an existing redshift cluster, without loading it into redshift.
What are spectrum nodes?
Nodes used by the redshift spectrum feature to query data directly from s3 without loading it into the redshift database.
Spectrum nodes are not part of your redshift cluster. They work behind the scenes when you query data directly from s3 with this feature.
What is opensearch?
A service for querying and analyzing data. Its commonly used complementary with other databases.
Opensearch is great for searching fileds with partial matches.
How do you provision opensearch?
You have 2 options. Managed mode and serverless mode.
With managed mode actual EC2 instances will be provisioned.
What is opensearch for?
You can use opensearch to query other databases, for partial data.
Then you can retrieve that data from its original database once you found it with opensearch.
TLDR: It helps you find data in other databases. That is why its an analytics DB.
Which database is best for fast searches of partial data?
Opensearch