Data Engineering Flashcards

Question 1

Q

Deploy Data Lake (s)

Answer

A

Distributed Data Objects

Use S3 in AWS

Question 2

Q

Data Vault (s)

Answer

A

Schema on Read to Schema on Write
Constitutes - Hub, Link, Satellites
Hubs - Business keys, Links link Business keys as link tables
Satellites store the actual data

Question 3

Q

Spark (s)

Answer

A

Cluster computing framework
Spark core, Spark SQL, Spark streaming, MLlib, GraphX
Spark Core - Scheduling & Dispatching, Cluster Computing, Distributed processing
Spark SQL - abstracts data as Dataframe

Question 4

Q

Mesos

Answer

A

Cluster Controller

Resource sharing at fine grained levels across cluster

Question 5

Q

Akka

Answer

A

Actor based Message driven Runtime

Question 6

Q

Cassandra

Answer

A

Distributed database

Question 7

Q

Kafka

Answer

A

Messaging & Streaming platform
Confluent.io is the provider
Kafka Core, Kafka Streams, Kafka Connect

Question 8

Q

CRISP - DM

HORUS

Answer

A

Data Lake Data Science Standard

A Standard that reduce the number of converters required for source data. Vouches a common data format.

Question 9

Q

Layered Approach (s)

Answer

A

Business, Utility, Operations Management, Audit/Balance/Control, Functional Layer,

Question 10

Q

Functional Layer

Answer

A

Sun Model, Conformed Dimension, Availability, Capacity, Extendability, Interoperability,

Question 11

Q

Data Democratization (s)

Answer

A

Treat all data equally relevant and source into a single platform to cut thru barriers of architecture, framework and models.

Question 12

Q

Argument against data democratization

Answer

A

Soon the lake will become swamp.

But this can be controlled with better governance.

Question 13

Q

In 3 words, what makes Unstructured data complex ?

Answer

A

Searching, Fetching, Consolidating

Question 14

Q

2 Main pieces of Data Ingestion

Answer

A

Data Collection and Data Integration

Question 15

Q

File Formats used in Data Lake

Answer

A

Parquet, ORC
Both are columnar file store, supports predicate pushdown, stripe indexes. When Sorted and Inserted, gives better read performance, but adds overhead during write.

Question 16

Q

Data Ingestion from Relational Source

Answer

A

Almost same as Extraction for DWH
Use Predicate filtering.
Capture Changed records thru a flag.
Perform Incremental extraction with timestamp or id field.
Perform full extraction if source volume is low.

Question 17

Q

Sqoop

Answer

A

SQL to Hadoop Bidirectional
Uses Mapper and Reducer jobs in HDFS
--Split-by Column splits across mappers
--autoreset-to-one-mapper
Can use Character based --split-by

Question 18

Q

Ingestion utilities by database vendors

Answer

A

Compared to Native utility, Ingestion Utility from db vendors capitalize on their database architecture.
Oracle provides CopyToBDA , Greenplum gphdfs,gpfdist
Edge Node , Writeable external table

Question 19

Q

Ingesting Unstructured Data

Flume

Answer

A

Collects and Transfers Streaming data as events.

Question 20

Q

Flume Components

Answer

A

Flume Event - The data to be transferred.
Source - PUTS data events from source system into Channel
Channel - Collects and persists events (temp storage)
Sink - TAKES (pulls) data from channel and pushes to HDFS.

Question 21

Q

Flume Design Considerations.

Answer

A

Choice of Channel is Memory, File, Kafka Topic

Question 22

Q

Kafka

Answer

A

A distributed processing, data Streaming tool
Topics, Producers, Consumers.
Provides APIs to Produce, Consume and process streams.

Data Engineering Flashcards

Data Lakes to Analysis