Data Engineering Flashcards
What does DAMA stand for?
The Data Management Association International
What does DMBOK stand for?
Data Management Body of Knowledge and is considered to be the definitive book for data management
What are key principles of SPC?
Statistical Process Control is a method used to monitor and control a process over time, by using statistical methods to analyze the data and determine if the process is performing within acceptable limits.
- Variation: All processes have variation. This variation can be either random or assignable. Random variation is due to chance and cannot be eliminated. Assignable variation is due to specific causes that can be identified and eliminated.
- Control: A process is in control when the variation is random. A process is out of control when the variation is due to assignable causes.
- Prevention: The goal of SPC is to prevent assignable causes of variation from occurring. This is done by identifying and eliminating the causes of variation.
- Continuous improvement: SPC is a continuous process. Once a process is in control, it is important to continuously monitor the process to ensure that it remains in control.
What are the three key principles of DODD
Data Observability Driven Development is a software development methodology that focuses on ensuring the quality and reliability of data throughout the development lifecycle.
- Contextual observability: Data should be observed in the context of its use. This means understanding how data is used throughout the development lifecycle, from ingestion to analysis. For example, what other systems it interacts with, and how it is affected by external factors such as load, network latency, and environmental conditions. In the past, observability was primarily focused on metrics and logs.
- Synchronized observability: Data should be observed in real time. This means being able to identify problems with data as soon as they occur.
- Continuous validation: Data should be validated continuously throughout the development lifecycle. This means testing data quality at every stage of development.
What does MDM stand for and what is it?
Master Data Management is the practice of building consistent entity definitions known as golden records. Golden records harmonize entity data across an organization and with its partners.
Master Data is about business entities such as employees, customers, products, and locations.
What is the process of converting data into usable form?
Data modeling and design
Define DAG
Directed Acyclic Graph
A DAG, or Directed Acyclic Graph, is a data structure used in computer science and data engineering to represent a set of tasks or operations and their dependencies. In a DAG, each node represents a task or operation, and each edge represents a dependency between tasks.
The “directed” aspect of a DAG means that each edge has a direction, indicating the flow of data or control between tasks. The “acyclic” aspect means that the graph does not contain any cycles or loops, so there are no circular dependencies between tasks. This is an important property of DAGs because it ensures that the tasks can be executed in a specific order without any infinite loops or deadlocks.
DAGs are commonly used in workflow management systems, such as Apache Airflow and Apache NiFi, to schedule and execute complex data pipelines. In a DAG-based workflow, each task represents a step in the pipeline, and the dependencies between tasks ensure that they are executed in the correct order. DAGs are also used in distributed computing systems, such as Apache Spark, to represent data processing operations and their dependencies.
Describe what windowing methods are?
In the context of data processing and analysis, “windowing methods” are techniques for dividing a dataset into smaller subsets, or windows, and performing computations on those subsets. These methods are often used in time-series analysis or other types of data analysis where the order and sequence of data points are important.
Windowing methods can be used to calculate rolling or moving statistics, such as rolling averages or moving medians, which provide a way to smooth out noise or fluctuations in the data over time. They can also be used to perform aggregate computations, such as calculating the sum, mean, or maximum value of a subset of data points within a fixed time interval.
There are several different types of windowing methods, including fixed windows, sliding windows, and tumbling windows. Fixed windows divide the data into non-overlapping subsets of a fixed size, while sliding windows divide the data into overlapping subsets of a fixed size that move incrementally through the data. Tumbling windows divide the data into non-overlapping subsets based on fixed time intervals, such as hours or days.
Windowing methods are widely used in various data processing and analysis tools, including Apache Spark, Apache Flink, and Pandas. They provide a powerful way to analyze and summarize large datasets, particularly in cases where the data is constantly evolving over time.
TOGAF
The Open Group Architecture Framework, a standard of the Open Group. It’s touted as the most widely used architecture framework today.
EABOK
Enterprise Architecture Book of Knowledge, an enterprise architecture reference produced by the MITRE Corporation. Largely abandoned.
EA
Enterprise Architecture is the design of systems to support change in the enterprise, achieved by flexible and reversible decisions reached through careful evaluation of trade-offs.