Data Ingestion and Transformation Flashcards
What are the five steps to data discovery in a project?
- Define Business Value
- Identify the data consumers
- Identify your data sources
- Define your storage, catalog, and access needs
- Define your processing needs.
What is the best definition of data discovery?
The process of finding and understanding relevant data sources within an organization and the relationships between them
What are the stages of a workflow in modern data architecture?
- Ingest
- Storage
- Catalog
- Process
- Deliver
- Security and Governance (covers all other five parts of workflow)
A data engineer at a large e-commerce company has built various data processing pipelines on AWS that need to run on daily, weekly, and monthly schedules. They want to implement an orchestration layer to automate the scheduling and operation of these pipelines. Which tools would BEST fit this requirement?
AWS Step Functions with AWS Lambda and AWS Glue to schedule and automate data-driven workflows
What are the key benefits to using a CI/CD pipeline?
- Faster time-to-insight by automating data pipeline builds, tests, and deployments
- Improved data quality and reliability by catching issues early in the development process
- Reduced operational overhead by automatically managing infrastructure and deployments
Define Infrastructure as Code
IaC refers to managing and provisioning infrastructure through machine-readable definition files instead of physical hardware configuration or interactive configuration tools.
What sections exist in a CloudFormation template?
- Format, Version, and Description
- Parameters
- Resources
- Outputs
What are the main benefits of using AWS Serverless Application Model?
- Simplified development and deployment process
- Local testing capabilities
- Seamless AWS services and resources integration
What are the three key components of AWS Site-to-Site VPN?
- VPN gateway on the AWS side
- Customer gateway on the on-premises side
- An encrypted VPN connection
What are spot instances?
Spot Instances are spare Amazon EC2 computing capacity offered at significantly discounted prices compared to On-Demand Instance pricing. The prices fluctuate based on supply and demand, but you can set a maximum price you’re willing to pay. These instances can be used for batch processing jobs or non-critical workloads that can tolerate interruptions such as data preprocessing, model training, or batch analytics jobs.
What is the primary purpose of AWS PrivateLink?
To provide private connectivity between AWS services and applications without using the public internet
What are the stages of the data engineering lifecycle?
- Generation
- Storage
- Ingestion
- Transformation
- Serving
What are the five Vs of data?
Variety
Volume
Velocity
Veracity / Validity
Value
What AWS services support stateful data transfer?
Amazon ElastiCache
Amazon RDS
What AWS services support stateless data transfer?
Lambda
API Gateway
Amazon S3