chp1 Fundamentals of data engineering Flashcards
What questions should I be asking if I want to display information to my end user
Who will consume the data? , What data sources should I use? , where should I store the data? , When should the data arrive? , Why does the data need to be stored in this place? , How should the data be processed?
What is the first thing to learn as a data engineer
The data lifecycle
What is it called when data is stored in different places
data silos
what is a data silo
When data is stored in different and independent places
What are the four logical building blocks in a data warehouse?
sql interface, schema, compute, and storage
what is a data lake
a centeralized repository that allows you to store all your stuctured and unstructured data
Is a schema madatory for a data lake
no. Why?
Where is a schema mandiotry ? Data lake or datawarehouse
data warehouse
What are two benefits of a data lake?
Scalability and cheap storage
Data is a data warehouse is modeled for a ….?
business purpose
What is one of the key requirements for building a data warehouse?
know the business requirements
The frontend applications in most cases acts as what?
the first data upstream
What is a data mart?
A data mart is an area for storing data that serves specific user groups
Each data mart is usually under control of ….?
each deparment within an organization
what are the three most common usages for the last stage in the data lifecycle?
Reporting and Dashboard, Ad hoc query, Machine Learning
data should only be stored where? based on business needs ?
data warehouse
Where is a schema manditory?
data warehouse
A data engineer is someone that…?
designs and builds data pipelines
What is a Job Orchestrator?
Design and build jobs dependancy and scheduler that runs data movement from upstream to downstream. Why?
What is ETL ?
Extract Transform Load. Why?
What is the difference between ETL and ELT?
ETL is where you make your transformations before the data is loaded in the downstream source and ELT is where you load the data into the downstream source than make the necessary transformations. Why ?
What is extract?
This is the step to get the data from the upstream system. Why?
What is transform?
This is the step to apply any transformations to the extracted data. Why>?
What is load?
This is the step to put the transformed data to a downstream system. Why?
All things in the data lifecycle are…?
ETL
any part of the data lifecycle that happens from upstream to downstrean is …?
ETL … Why?
Big data technology needs to ….?
be able to distribute the data in mulitple servers. Why ?
What is the key foundation of data engineering?
ETL - Why ?
Not all systems can ____ large dataset
Transform. Why?
What is the common terminology for mutliple servers that are working together?
Cluster. Why?
What is a cluster?
multiple servers working together. Why?
In a distributed filesystem, a large file will be ….?
split into multiplle small parts. Why?
What are the four steps to mapreduce?
Map , Shuffle, Reduce, Result. Why?
What is the use case of MapReduce?
To process distributed filesystems. Why?
MapReduce happens in ….?
Parallel. Why?
MapReduce always maintains …?
three parallel boxes when performing map, shuffle , and reduce. Why?
What is the definition of a data lake?
A centralized repository that allows you to store all your Structured and unstructured data 
Outline or draw detail, life cycle diagram
The last stage of data will come back to humans as what?
Information
The application is the interface from blank to blank?
The application is the interface from human to the machine.
Why?
Did engineers someone who designs and builds…?
#DATA pipelines
Explain what a job orchestrator is:
Program or software that designs and builds a job dependency and schedulwe that runs data movement from upstream downstream 
What are the focus points of a data engineer?
Why is designing and building a data mart not as simple as it seems?
What does ETL stand for?
Extract transform load
What is big data?
Any data that’s large enough that it needs to be distributed across several servers in order for you to run calculations
Draw the diagram for ETL
What is extract?
This is the step to get the data from the upstream system
What is transform?
This is a step to apply any transformation to the extracted data 
Why is every step in the data lifecycle ETL
Draw the diagram for ELT
You have two options of where to transform your data. What are they?
In a intermediary system or in the target system.