Data Ingestion and Transformation Flashcards
What are the five steps to data discovery in a project?
- Define Business Value
- Identify the data consumers
- Identify your data sources
- Define your storage, catalog, and access needs
- Define your processing needs.
What is the best definition of data discovery?
The process of finding and understanding relevant data sources within an organization and the relationships between them
What are the stages of a workflow in modern data architecture?
- Ingest
- Storage
- Catalog
- Process
- Deliver
- Security and Governance (covers all other five parts of workflow)
A data engineer at a large e-commerce company has built various data processing pipelines on AWS that need to run on daily, weekly, and monthly schedules. They want to implement an orchestration layer to automate the scheduling and operation of these pipelines. Which tools would BEST fit this requirement?
AWS Step Functions with AWS Lambda and AWS Glue to schedule and automate data-driven workflows
What are the key benefits to using a CI/CD pipeline?
- Faster time-to-insight by automating data pipeline builds, tests, and deployments
- Improved data quality and reliability by catching issues early in the development process
- Reduced operational overhead by automatically managing infrastructure and deployments
Define Infrastructure as Code
IaC refers to managing and provisioning infrastructure through machine-readable definition files instead of physical hardware configuration or interactive configuration tools.
What sections exist in a CloudFormation template?
- Format, Version, and Description
- Parameters
- Resources
- Outputs
What are the main benefits of using AWS Serverless Application Model?
- Simplified development and deployment process
- Local testing capabilities
- Seamless AWS services and resources integration
What are the three key components of AWS Site-to-Site VPN?
- VPN gateway on the AWS side
- Customer gateway on the on-premises side
- An encrypted VPN connection
What are spot instances?
Spot Instances are spare Amazon EC2 computing capacity offered at significantly discounted prices compared to On-Demand Instance pricing. The prices fluctuate based on supply and demand, but you can set a maximum price you’re willing to pay. These instances can be used for batch processing jobs or non-critical workloads that can tolerate interruptions such as data preprocessing, model training, or batch analytics jobs.
What is the primary purpose of AWS PrivateLink?
To provide private connectivity between AWS services and applications without using the public internet
What are the stages of the data engineering lifecycle?
- Generation
- Storage
- Ingestion
- Transformation
- Serving
What are the five Vs of data?
Variety
Volume
Velocity
Veracity / Validity
Value
What AWS services support stateful data transfer?
Amazon ElastiCache
Amazon RDS
What AWS services support stateless data transfer?
Lambda
API Gateway
Amazon S3
What topics should you look for when troubleshooting and doing performance optimization?
- Bottlenecks
- High processing times, memory usage, or higher I/O operations
- Algorithms, partition strategy, or parallel processing
- Resource allocation
- Caching
What steps should you go through when troubleshooting problems in a pipeline?
- Check logs
- Verify data at different stages of process
- Implement incremental processing
- Add retries
- Test
What is needed to set up a JDBC connection (Java DB Connectivity)?
- Prepare the JDBC driver
- Configure security groups
- Use a connection URL, Username, and PW
What is needed to set up an ODBC connection (Open DB connectivity)?
- Select an ODBC driver
- Install an ODBC driver
- Set up an ODBC data source name (DSN)
- Specify a DSN
What services exist to help cache data in your APIs?
- ElastiCache
- API Gateway
- Amazon CloudWatch - monitor and track usage of API
How can you scale an API?
- Use auto-scaling in Lambda functions
- Configure API Gateway to handle high concurrency and availability requirements.
What are the two types of data architecture?
Operational architecture - the functional requirements
Technical architecture - how the data is ingested, stored, transformed, and served.
What types of isses can cause a pipeline failure?
- Data quality problem - ex wrong file format
- Code errors
- Endpoint errors - ex a service is temporarily offline
- Dependency errors
Which services can you use for data pipeline orchestration?
- AWS Data Pipeline
- AWS Glue
- Step Functions
- MWAA