chp1 Fundamentals of data engineering Flashcards

1
Q

What questions should I be asking if I want to display information to my end user

A

Who will consume the data? , What data sources should I use? , where should I store the data? , When should the data arrive? , Why does the data need to be stored in this place? , How should the data be processed?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the first thing to learn as a data engineer

A

The data lifecycle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is it called when data is stored in different places

A

data silos

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is a data silo

A

When data is stored in different and independent places

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the four logical building blocks in a data warehouse?

A

sql interface, schema, compute, and storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is a data lake

A

a centeralized repository that allows you to store all your stuctured and unstructured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Is a schema madatory for a data lake

A

no. Why?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Where is a schema mandiotry ? Data lake or datawarehouse

A

data warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are two benefits of a data lake?

A

Scalability and cheap storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data is a data warehouse is modeled for a ….?

A

business purpose

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is one of the key requirements for building a data warehouse?

A

know the business requirements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The frontend applications in most cases acts as what?

A

the first data upstream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a data mart?

A

A data mart is an area for storing data that serves specific user groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Each data mart is usually under control of ….?

A

each deparment within an organization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what are the three most common usages for the last stage in the data lifecycle?

A

Reporting and Dashboard, Ad hoc query, Machine Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

data should only be stored where? based on business needs ?

A

data warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Where is a schema manditory?

A

data warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

A data engineer is someone that…?

A

designs and builds data pipelines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a Job Orchestrator?

A

Design and build jobs dependancy and scheduler that runs data movement from upstream to downstream. Why?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is ETL ?

A

Extract Transform Load. Why?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the difference between ETL and ELT?

A

ETL is where you make your transformations before the data is loaded in the downstream source and ELT is where you load the data into the downstream source than make the necessary transformations. Why ?

22
Q

What is extract?

A

This is the step to get the data from the upstream system. Why?

23
Q

What is transform?

A

This is the step to apply any transformations to the extracted data. Why>?

24
Q

What is load?

A

This is the step to put the transformed data to a downstream system. Why?

25
Q

All things in the data lifecycle are…?

A

ETL

26
Q

any part of the data lifecycle that happens from upstream to downstrean is …?

A

ETL … Why?

27
Q

Big data technology needs to ….?

A

be able to distribute the data in mulitple servers. Why ?

28
Q

What is the key foundation of data engineering?

A

ETL - Why ?

29
Q

Not all systems can ____ large dataset

A

Transform. Why?

30
Q

What is the common terminology for mutliple servers that are working together?

A

Cluster. Why?

31
Q

What is a cluster?

A

multiple servers working together. Why?

32
Q

In a distributed filesystem, a large file will be ….?

A

split into multiplle small parts. Why?

33
Q

What are the four steps to mapreduce?

A

Map , Shuffle, Reduce, Result. Why?

34
Q

What is the use case of MapReduce?

A

To process distributed filesystems. Why?

35
Q

MapReduce happens in ….?

A

Parallel. Why?

36
Q

MapReduce always maintains …?

A

three parallel boxes when performing map, shuffle , and reduce. Why?

37
Q

What is the definition of a data lake?

A

A centralized repository that allows you to store all your Structured and unstructured data 

38
Q

Outline or draw detail, life cycle diagram

A
39
Q

The last stage of data will come back to humans as what?

A

Information

40
Q

The application is the interface from blank to blank?

A

The application is the interface from human to the machine.

Why?

41
Q

Did engineers someone who designs and builds…?

A

#DATA pipelines

42
Q

Explain what a job orchestrator is:

A

Program or software that designs and builds a job dependency and schedulwe that runs data movement from upstream downstream 

43
Q

What are the focus points of a data engineer?

A
44
Q

Why is designing and building a data mart not as simple as it seems?

A
45
Q

What does ETL stand for?

A

Extract transform load

46
Q

What is big data?

A

Any data that’s large enough that it needs to be distributed across several servers in order for you to run calculations

47
Q

Draw the diagram for ETL

A
48
Q

What is extract?

A

This is the step to get the data from the upstream system

49
Q

What is transform?

A

This is a step to apply any transformation to the extracted data 

50
Q

Why is every step in the data lifecycle ETL

A
51
Q

Draw the diagram for ELT

A
52
Q

You have two options of where to transform your data. What are they?

A

In a intermediary system or in the target system.