Data Ingestion and Transformation Flashcards by Dan Falcone

What are the five steps to data discovery in a project?

Define Business Value
Identify the data consumers
Identify your data sources
Define your storage, catalog, and access needs
Define your processing needs.

How well did you know this?

Not at all

Perfectly

What is the best definition of data discovery?

The process of finding and understanding relevant data sources within an organization and the relationships between them

How well did you know this?

Not at all

Perfectly

What are the stages of a workflow in modern data architecture?

Ingest
Storage
Catalog
Process
Deliver
Security and Governance (covers all other five parts of workflow)

How well did you know this?

Not at all

Perfectly

A data engineer at a large e-commerce company has built various data processing pipelines on AWS that need to run on daily, weekly, and monthly schedules. They want to implement an orchestration layer to automate the scheduling and operation of these pipelines. Which tools would BEST fit this requirement?

AWS Step Functions with AWS Lambda and AWS Glue to schedule and automate data-driven workflows

How well did you know this?

Not at all

Perfectly

What are the key benefits to using a CI/CD pipeline?

Faster time-to-insight by automating data pipeline builds, tests, and deployments
Improved data quality and reliability by catching issues early in the development process
Reduced operational overhead by automatically managing infrastructure and deployments

How well did you know this?

Not at all

Perfectly

Define Infrastructure as Code

IaC refers to managing and provisioning infrastructure through machine-readable definition files instead of physical hardware configuration or interactive configuration tools.

How well did you know this?

Not at all

Perfectly

What sections exist in a CloudFormation template?

Format, Version, and Description
Parameters
Resources
Outputs

How well did you know this?

Not at all

Perfectly

What are the main benefits of using AWS Serverless Application Model?

Simplified development and deployment process
Local testing capabilities
Seamless AWS services and resources integration

How well did you know this?

Not at all

Perfectly

What are the three key components of AWS Site-to-Site VPN?

VPN gateway on the AWS side
Customer gateway on the on-premises side
An encrypted VPN connection

How well did you know this?

Not at all

Perfectly

What are spot instances?

Spot Instances are spare Amazon EC2 computing capacity offered at significantly discounted prices compared to On-Demand Instance pricing. The prices fluctuate based on supply and demand, but you can set a maximum price you’re willing to pay. These instances can be used for batch processing jobs or non-critical workloads that can tolerate interruptions such as data preprocessing, model training, or batch analytics jobs.

How well did you know this?

Not at all

Perfectly

What is the primary purpose of AWS PrivateLink?

To provide private connectivity between AWS services and applications without using the public internet

How well did you know this?

Not at all

Perfectly

What are the stages of the data engineering lifecycle?

Generation
Storage
Ingestion
Transformation
Serving

How well did you know this?

Not at all

Perfectly

What are the five Vs of data?

Variety
Volume
Velocity
Veracity / Validity
Value

How well did you know this?

Not at all

Perfectly

What AWS services support stateful data transfer?

Amazon ElastiCache
Amazon RDS

How well did you know this?

Not at all

Perfectly

What AWS services support stateless data transfer?

Lambda
API Gateway
Amazon S3

How well did you know this?

Not at all

Perfectly

What topics should you look for when troubleshooting and doing performance optimization?

Study These Flashcards

Bottlenecks
High processing times, memory usage, or higher I/O operations
Algorithms, partition strategy, or parallel processing
Resource allocation
Caching

What steps should you go through when troubleshooting problems in a pipeline?

Study These Flashcards

Check logs
Verify data at different stages of process
Implement incremental processing
Add retries
Test

What is needed to set up a JDBC connection (Java DB Connectivity)?

Study These Flashcards

Prepare the JDBC driver
Configure security groups
Use a connection URL, Username, and PW

What is needed to set up an ODBC connection (Open DB connectivity)?

Study These Flashcards

Select an ODBC driver
Install an ODBC driver
Set up an ODBC data source name (DSN)
Specify a DSN

What services exist to help cache data in your APIs?

Study These Flashcards

ElastiCache
API Gateway
Amazon CloudWatch - monitor and track usage of API

How can you scale an API?

Study These Flashcards

Use auto-scaling in Lambda functions
Configure API Gateway to handle high concurrency and availability requirements.

What are the two types of data architecture?

Study These Flashcards

Operational architecture - the functional requirements
Technical architecture - how the data is ingested, stored, transformed, and served.

What types of isses can cause a pipeline failure?

Study These Flashcards

Data quality problem - ex wrong file format
Code errors
Endpoint errors - ex a service is temporarily offline
Dependency errors

Which services can you use for data pipeline orchestration?

Study These Flashcards

AWS Data Pipeline
AWS Glue
Step Functions
MWAA

What are best practices for data pipelines?

1. Distributed processing (Spark) 2. Auto scaling 3. Data partitioning 4. fault-tolerant storage (S3 & EFS) 5. Backups (S3 or AWS Backup) 6. Monitoring (CloudWatch) 7. Validation and quality checks 8. Automated Testing 9. CI / CD

A finance company has developed a machine learning (ML) model to enhance its investment strategy. The model uses various sources of data about stock, bond, and commodities markets. The model has been approved for production. A data engineer must ensure that the data being used to run ML decisions is accurate, complete, and trustworthy. The data engineer must automate the data preparation for the model's production deployment. Which solution will meet these requirements?

SageMaker ML Lineage Tracking creates and stores information about the steps of an ML workflow. SageMaker ML Lineage Tracking gives you the ability to establish model governance and audit standards. SageMaker ML Lineage Tracking helps to ensure that the data being used to run ML decisions is accurate, complete, and trustworthy.

An ecommerce company runs several applications on AWS. The company wants to design a centralized streaming log ingestion solution. The solution needs to be able to convert the log files to Apache Parquet format. Then, the solution must store the log files in Amazon S3. The number of log files being created varies throughout the day. A data engineer must configure a solution that ensures the log files are delivered in near real time. Which solution will meet these requirements with the LEAST operational overhead?

Configure the applications to send the log files to Kinesis Data Firehose. Configure Firehose to invoke a Lamdba function that converts the log files to Parquet format. Configure Firehose to deliver the Parquet files to an output S3 bucket. You can use Kinesis Data Firehose to deliver log files to Amazon S3 with the least operational overhead. You can use a data-transformation Lambda function with Kinesis Data Firehose. This solution can convert log files to the correct format before the log files are delivered to Amazon S3.

A company has deployed a data pipeline that uses AWS Glue to process records. The records include a JSON-formatted event and can sometimes include base64-encoded images. The AWS Glue job is configured with 10 data processing units (DPUs). However, the AWS Glue job regularly scales to several hundred DPUs and can take a long time to run. A data engineer must monitor the data pipeline to determine the appropriate DPU capacity. Which solution will meet these requirements?

Inspect the job monitoring section of the AWS Glue console. Review the results of the previous job runs. Visualize the profiled metrics to determine the appropriate number of DPUs. You can use the job run monitoring section of the AWS Glue console to determine the appropriate DPU capacity for this scenario. The job monitoring section of the AWS Glue console uses the results of previous job runs to determine the appropriate DPU capacity.

An Amazon Kinesis application is trying to read data from a Kinesis data stream. However, the read data call is rejected. The following error message is displayed: ProvisionedThroughputExceededException. Which combination of steps will resolve the error? (Select TWO.)

Increase the number of shards within the stream to provide enough capacity for the read data calls. The ProvisionedThroughputExceededException error is caused by the capacity quotas of the data stream exceeding its provisioned amount. A sustained rise of the stream's output data rate can cause this issue. To resolve the issue, you can increase the number of shards within your stream to provide enough capacity for the read data calls to consistently succeed. 2. Make the application retry to read data from the stream. The ProvisionedThroughputExceededException is caused by the capacity quotas of the data stream exceeding its provisioned amount. A sustained rise of the stream's output data rate can cause this issue. A solution that retries the Kinesis application will eventually lead to completions of the requests.

Data Ingestion and Transformation Flashcards

(29 cards)