Data Platforms Flashcards
Data-Driven Innovation
Refers to the use of analytics to drive innovation and business value from data
Analytics
In this context, we mean the different types of business intelligence initiatives
Advanced Analytics
Semi-autonomous examination of data to get deeper insights (Machine Learning)
Augmented Analytics
Augment how people explore data with the incorporation of AI
Database
Structured and persistent collection of information with efficient retrieval and modification (relational databases)
Data Warehouse
Subject oriented collection of data that supports decision making processes
OLTP
Constant queries and updates, short term data retention. (Accounting database, online retail transactions)
OLAP
Periodic large updates, complex queries for reporting/decision support
Data Lake
Central repository system where data is kept in various original formats, unstructured, semi-structured, structured and queried only when needed.
Supports storage, processing and analysis
What kind of users use Data Warehouses vs Data Lakes
Business analysts
Vs
Data scientists, data developers, and business analysts
What kind of users use Data Warehouses vs Data Lakes
Business analysts
Vs
Data scientists, data developers, and business analysts
Data Platform
Meets end-to-end data needs such as acquisition, storage, preparation, delivery, governance and security so users ONLY focus on functional aspects
How do we prevent DP from becoming a swamp?
We MUST govern data transformations and leverage metadata and maintenance to keep control over data
What are 5 areas of data management? (PCPED) Plankton chokes Patrick every day
- Data provenance
- Compression
- Data profiling
- Entity resolution
- Data versioning
Data Provenance
Descriptions of origins of data and process by which it arrives
Data Provenance Granularity
Fine-grained (instance level)
Coarse-grained (schema level)
Tracking items vs dataset transformations
Three levels (types) of data provenance (EAA)
Entity (physical/conceptual thing)
Activity (what generated the thing)
Agent (associated with the activity)
Compression
Concise representation of a dataset in a comprehensible manner
Data profiling
Analyzing the structure and quality of a dataset ?
Scanned for metadata, completeness and uniqueness of columns, keys and foreign keys
Two things data profiling can help with
- Optimizing queries
- Cleansing (errors in data)
Entity resolution
Find records that refer to the same entity
Version Control
Managing changes to computer programs/data collections with a code as the version number.
Data versioning
Version control that extends to data models, model parameter tracking and performance comparison
Data lakehouse
Flexibility of data lakes and structure of data warehouses (ACID transactions) to combine BI and ML
Vendor lock in…?
Data Platform Engineer job description
Implement cloud technologies within data structure of business, in charge of purchasing decisions for cloud services and approval of data architectures
DevOpS
Enable software DEVeleopment and operations teams to accelerate delivery with collaboration and iterative improvement
DataOps
Use automation to shorten data analytic lifecycle
Data fabric
Seamless data access and sharing in distributed environment
Fabric is smooth, unified surface
Data mesh
Decentralized, distributed governance and domains owning data products.
Mesh is a grid like surface with interconnected “nodes”/“domains”