Data Engineering Flashcards

Data Lakes to Analysis

1
Q

Deploy Data Lake (s)

A

Distributed Data Objects

Use S3 in AWS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data Vault (s)

A

Schema on Read to Schema on Write
Constitutes - Hub, Link, Satellites
Hubs - Business keys, Links link Business keys as link tables
Satellites store the actual data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Spark (s)

A

Cluster computing framework
Spark core, Spark SQL, Spark streaming, MLlib, GraphX
Spark Core - Scheduling & Dispatching, Cluster Computing, Distributed processing
Spark SQL - abstracts data as Dataframe

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Mesos

A

Cluster Controller

Resource sharing at fine grained levels across cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Akka

A

Actor based Message driven Runtime

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Cassandra

A

Distributed database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Kafka

A

Messaging & Streaming platform
Confluent.io is the provider
Kafka Core, Kafka Streams, Kafka Connect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

CRISP - DM

HORUS

A

Data Lake Data Science Standard

A Standard that reduce the number of converters required for source data. Vouches a common data format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Layered Approach (s)

A

Business, Utility, Operations Management, Audit/Balance/Control, Functional Layer,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Functional Layer

A

Sun Model, Conformed Dimension, Availability, Capacity, Extendability, Interoperability,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data Democratization (s)

A

Treat all data equally relevant and source into a single platform to cut thru barriers of architecture, framework and models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Argument against data democratization

A

Soon the lake will become swamp.

But this can be controlled with better governance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In 3 words, what makes Unstructured data complex ?

A

Searching, Fetching, Consolidating

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

2 Main pieces of Data Ingestion

A

Data Collection and Data Integration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

File Formats used in Data Lake

A

Parquet, ORC
Both are columnar file store, supports predicate pushdown, stripe indexes. When Sorted and Inserted, gives better read performance, but adds overhead during write.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Data Ingestion from Relational Source

A

Almost same as Extraction for DWH
Use Predicate filtering.
Capture Changed records thru a flag.
Perform Incremental extraction with timestamp or id field.
Perform full extraction if source volume is low.

17
Q

Sqoop

A
SQL to Hadoop Bidirectional
Uses Mapper and Reducer jobs in HDFS
--Split-by Column splits across mappers
--autoreset-to-one-mapper
Can use Character based --split-by
18
Q

Ingestion utilities by database vendors

A

Compared to Native utility, Ingestion Utility from db vendors capitalize on their database architecture.
Oracle provides CopyToBDA , Greenplum gphdfs,gpfdist
Edge Node , Writeable external table

19
Q

Ingesting Unstructured Data

Flume

A

Collects and Transfers Streaming data as events.

20
Q

Flume Components

A

Flume Event - The data to be transferred.
Source - PUTS data events from source system into Channel
Channel - Collects and persists events (temp storage)
Sink - TAKES (pulls) data from channel and pushes to HDFS.

21
Q

Flume Design Considerations.

A

Choice of Channel is Memory, File, Kafka Topic

22
Q

Kafka

A

A distributed processing, data Streaming tool
Topics, Producers, Consumers.
Provides APIs to Produce, Consume and process streams.