Data Engineering Flashcards
Data Lakes to Analysis
Deploy Data Lake (s)
Distributed Data Objects
Use S3 in AWS
Data Vault (s)
Schema on Read to Schema on Write
Constitutes - Hub, Link, Satellites
Hubs - Business keys, Links link Business keys as link tables
Satellites store the actual data
Spark (s)
Cluster computing framework
Spark core, Spark SQL, Spark streaming, MLlib, GraphX
Spark Core - Scheduling & Dispatching, Cluster Computing, Distributed processing
Spark SQL - abstracts data as Dataframe
Mesos
Cluster Controller
Resource sharing at fine grained levels across cluster
Akka
Actor based Message driven Runtime
Cassandra
Distributed database
Kafka
Messaging & Streaming platform
Confluent.io is the provider
Kafka Core, Kafka Streams, Kafka Connect
CRISP - DM
HORUS
Data Lake Data Science Standard
A Standard that reduce the number of converters required for source data. Vouches a common data format.
Layered Approach (s)
Business, Utility, Operations Management, Audit/Balance/Control, Functional Layer,
Functional Layer
Sun Model, Conformed Dimension, Availability, Capacity, Extendability, Interoperability,
Data Democratization (s)
Treat all data equally relevant and source into a single platform to cut thru barriers of architecture, framework and models.
Argument against data democratization
Soon the lake will become swamp.
But this can be controlled with better governance.
In 3 words, what makes Unstructured data complex ?
Searching, Fetching, Consolidating
2 Main pieces of Data Ingestion
Data Collection and Data Integration
File Formats used in Data Lake
Parquet, ORC
Both are columnar file store, supports predicate pushdown, stripe indexes. When Sorted and Inserted, gives better read performance, but adds overhead during write.