Data Ingestion and Transformation Flashcards

1
Q

What are the five steps to data discovery in a project?

A
  1. Define Business Value
  2. Identify the data consumers
  3. Identify your data sources
  4. Define your storage, catalog, and access needs
  5. Define your processing needs.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the best definition of data discovery?

A

The process of finding and understanding relevant data sources within an organization and the relationships between them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the stages of a workflow in modern data architecture?

A
  1. Ingest
  2. Storage
  3. Catalog
  4. Process
  5. Deliver
  6. Security and Governance (covers all other five parts of workflow)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A data engineer at a large e-commerce company has built various data processing pipelines on AWS that need to run on daily, weekly, and monthly schedules. They want to implement an orchestration layer to automate the scheduling and operation of these pipelines. Which tools would BEST fit this requirement?

A

AWS Step Functions with AWS Lambda and AWS Glue to schedule and automate data-driven workflows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the key benefits to using a CI/CD pipeline?

A
  1. Faster time-to-insight by automating data pipeline builds, tests, and deployments
  2. Improved data quality and reliability by catching issues early in the development process
  3. Reduced operational overhead by automatically managing infrastructure and deployments
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define Infrastructure as Code

A

IaC refers to managing and provisioning infrastructure through machine-readable definition files instead of physical hardware configuration or interactive configuration tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What sections exist in a CloudFormation template?

A
  1. Format, Version, and Description
  2. Parameters
  3. Resources
  4. Outputs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the main benefits of using AWS Serverless Application Model?

A
  1. Simplified development and deployment process
  2. Local testing capabilities
  3. Seamless AWS services and resources integration
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the three key components of AWS Site-to-Site VPN?

A
  1. VPN gateway on the AWS side
  2. Customer gateway on the on-premises side
  3. An encrypted VPN connection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are spot instances?

A

Spot Instances are spare Amazon EC2 computing capacity offered at significantly discounted prices compared to On-Demand Instance pricing. The prices fluctuate based on supply and demand, but you can set a maximum price you’re willing to pay. These instances can be used for batch processing jobs or non-critical workloads that can tolerate interruptions such as data preprocessing, model training, or batch analytics jobs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the primary purpose of AWS PrivateLink?

A

To provide private connectivity between AWS services and applications without using the public internet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the stages of the data engineering lifecycle?

A
  1. Generation
  2. Storage
  3. Ingestion
  4. Transformation
  5. Serving
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the five Vs of data?

A

Variety
Volume
Velocity
Veracity / Validity
Value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What AWS services support stateful data transfer?

A

Amazon ElastiCache
Amazon RDS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What AWS services support stateless data transfer?

A

Lambda
API Gateway
Amazon S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What topics should you look for when troubleshooting and doing performance optimization?

A
  1. Bottlenecks
  2. High processing times, memory usage, or higher I/O operations
  3. Algorithms, partition strategy, or parallel processing
  4. Resource allocation
  5. Caching
17
Q

What steps should you go through when troubleshooting problems in a pipeline?

A
  1. Check logs
  2. Verify data at different stages of process
  3. Implement incremental processing
  4. Add retries
  5. Test
18
Q

What is needed to set up a JDBC connection (Java DB Connectivity)?

A
  1. Prepare the JDBC driver
  2. Configure security groups
  3. Use a connection URL, Username, and PW
19
Q

What is needed to set up an ODBC connection (Open DB connectivity)?

A
  1. Select an ODBC driver
  2. Install an ODBC driver
  3. Set up an ODBC data source name (DSN)
  4. Specify a DSN
20
Q

What services exist to help cache data in your APIs?

A
  1. ElastiCache
  2. API Gateway
  3. Amazon CloudWatch - monitor and track usage of API
21
Q

How can you scale an API?

A
  1. Use auto-scaling in Lambda functions
  2. Configure API Gateway to handle high concurrency and availability requirements.
22
Q

What are the two types of data architecture?

A

Operational architecture - the functional requirements
Technical architecture - how the data is ingested, stored, transformed, and served.

23
Q

What types of isses can cause a pipeline failure?

A
  1. Data quality problem - ex wrong file format
  2. Code errors
  3. Endpoint errors - ex a service is temporarily offline
  4. Dependency errors
24
Q

Which services can you use for data pipeline orchestration?

A
  1. AWS Data Pipeline
  2. AWS Glue
  3. Step Functions
  4. MWAA
25
Q

What are best practices for data pipelines?

A
  1. Distributed processing (Spark)
  2. Auto scaling
  3. Data partitioning
  4. fault-tolerant storage (S3 & EFS)
  5. Backups (S3 or AWS Backup)
  6. Monitoring (CloudWatch)
  7. Validation and quality checks
  8. Automated Testing
  9. CI / CD
26
Q

A finance company has developed a machine learning (ML) model to enhance its investment strategy. The model uses various sources of data about stock, bond, and commodities markets. The model has been approved for production. A data engineer must ensure that the data being used to run ML decisions is accurate, complete, and trustworthy. The data engineer must automate the data preparation for the model’s production deployment.

Which solution will meet these requirements?

A

SageMaker ML Lineage Tracking creates and stores information about the steps of an ML workflow. SageMaker ML Lineage Tracking gives you the ability to establish model governance and audit standards. SageMaker ML Lineage Tracking helps to ensure that the data being used to run ML decisions is accurate, complete, and trustworthy.

27
Q

An ecommerce company runs several applications on AWS. The company wants to design a centralized streaming log ingestion solution. The solution needs to be able to convert the log files to Apache Parquet format. Then, the solution must store the log files in Amazon S3. The number of log files being created varies throughout the day. A data engineer must configure a solution that ensures the log files are delivered in near real time.

Which solution will meet these requirements with the LEAST operational overhead?

A

Configure the applications to send the log files to Kinesis Data Firehose. Configure Firehose to invoke a Lamdba function that converts the log files to Parquet format. Configure Firehose to deliver the Parquet files to an output S3 bucket.

You can use Kinesis Data Firehose to deliver log files to Amazon S3 with the least operational overhead. You can use a data-transformation Lambda function with Kinesis Data Firehose. This solution can convert log files to the correct format before the log files are delivered to Amazon S3.

28
Q

A company has deployed a data pipeline that uses AWS Glue to process records. The records include a JSON-formatted event and can sometimes include base64-encoded images. The AWS Glue job is configured with 10 data processing units (DPUs). However, the AWS Glue job regularly scales to several hundred DPUs and can take a long time to run.

A data engineer must monitor the data pipeline to determine the appropriate DPU capacity.

Which solution will meet these requirements?

A

Inspect the job monitoring section of the AWS Glue console. Review the results of the previous job runs. Visualize the profiled metrics to determine the appropriate number of DPUs.

You can use the job run monitoring section of the AWS Glue console to determine the appropriate DPU capacity for this scenario. The job monitoring section of the AWS Glue console uses the results of previous job runs to determine the appropriate DPU capacity.

29
Q

An Amazon Kinesis application is trying to read data from a Kinesis data stream. However, the read data call is rejected. The following error message is displayed: ProvisionedThroughputExceededException.

Which combination of steps will resolve the error? (Select TWO.)

A

Increase the number of shards within the stream to provide enough capacity for the read data calls.

The ProvisionedThroughputExceededException error is caused by the capacity quotas of the data stream exceeding its provisioned amount. A sustained rise of the stream’s output data rate can cause this issue. To resolve the issue, you can increase the number of shards within your stream to provide enough capacity for the read data calls to consistently succeed.

  1. Make the application retry to read data from the stream.

The ProvisionedThroughputExceededException is caused by the capacity quotas of the data stream exceeding its provisioned amount. A sustained rise of the stream’s output data rate can cause this issue. A solution that retries the Kinesis application will eventually lead to completions of the requests.