Data Quality and Data Governance Flashcards
Why are data integration principles needed?
- Data is integrated from multiple sources
- Different technologies and processes are needed
- Multiple challenges to be addressed
What are the data integration principles?
- Standardisation
- Reconciliation
- Validation
- Transformation
- Cleansing
- Enrichment
- Privacy
What is data standardisation?
Transforming data from different sources into a common format and structure (such as ISO date format)
What is data reconciliation?
Focusing on data consistency by resolving issues of inconsistency between data from different sources
Why might data inconsistencies arise?
- Integrating distinct data sources where entries may differ
- Failures during integration
- Data that is copied or transformed incorrectly
- Missing / duplicate records or values
- Incorrectly formatted values
- Broken relationships between tables
What is data validation?
Assessing the accuracy and completeness of data from different sources with meeting defined constaints
What are examples of data validation?
Checking that email addresses, phone numbers or postcodes are complete and follow a set pattern
What is data transformation?
Converting data to a specified schema
What is data cleansing?
Removing inconsistent data such as duplicate, irrelevant or incorrect data
What might data cleanising be used to check?
Two records for the same entity which have slightly different entries
When is data cleansing applied?
During Gross Error Detection
What is Gross Error Detection?
A process to discard erroneous data using outlier detection methods
When might Gross Error Detection be used?
During integration of raw sensor measurements which may be subject to calibration uncertainties or instrument failures
What is data enrichment?
Including supplementary data sources to add further insights and value to the original data
What is data privacy?
- Ensures the protection of personal rights and confidentiality of personal data during integration
- Sensitive data must be encrypted or obfuscated during integration
Why is data quality more of a social challenge than a technical challenge?
The way data is collected, organised, modified, accessed, interacted with and conclusions drawn from the data is a communication effort.
What is offered by typical ETL solutions?
Rich toolsets for integrating data in line with defined policies and standards
How do general purpose cloud-based ETL tools work?
Pipeline data processing in a standardised way
What are some of the general purpose cloud-based ETL tools available?
- Azure Data Factory
- AWS Glue
- Google Cloud Fusion
- Good Cloud Dataflow
What are other available data mapping and processing solutions available and how do they work?
- Allow data integration from various sources to blend, map, cleanse and diversify data
- Pentaho Data Integration
- Talent Data Integration
- IBM InfoSphere
- Informativa PowerCenter
- Miscrosoft SQL Server Integration Services (SSIS)
- IBM InfoSphere DataStage
- Oracle Data Integrator
- Feature Manipulation Engine
- Altove MapForce Platform
- Apache Nifi
What data mapping and processing solutions can be used as standalone data reconsiliation tools?
- Open Refine
- TIBCO Clarity
- Winpure
What data mapping and processing solutions can carry out data enrichment?
- Clearbit
- Pipl
- FullContact
What is Apache Nifi?
An open sources solution which can be used for distributed data processing by defining processing pipelines
What connectors does Apache Nifi offer?
A number of connectors compatible with different sources
What interface does Apache Nifi offer?
Uses a web interface to construct dataflow pipeline in a GUI
What is the set up of Apache Nifi?
- Uses parallel Java Virtual Machines (JVM)
- Has a flow-based distributed architecture
What are the elements that make up the Apache Nifi architecture?
- Flow controller nodes: supervise the execution of threads
- Processor nodes: perform the ETL processing
What are the responsibilities of the processors in Apache Nifi?
- Pulling data from external sources
- Publication
- Routing data to external sources
- Transforming and recovering information from Flowfiles
What are Apache Nifi’s Flowfiles?
- The basic unit of information
- Data objects that move through Nifi and hold data content as key-value pairs
What does Apache Nifi use to keep its cluster of machines organised and consistent?
Zookeeper
What repositories are used to manage workflow in Apache Nifi?
- Flowfile repository: tracks worklow by recording metadata about how each data object is processed
- Content repository: stores the transferred data
- Provenance repository: holds data transfer events (i.e. Flowfile history)
What are the benefits of using Apache Nifi?
- Highly scalable
- Can process large amouts of data in a reasonable time through parallel processing
- Can be run as a single instance or operated within a cluster, managed by an orchestrator such as Zookeeper
What is Azure Data Factory?
A service for the orchestration of data movement and transformation on the Azure platform
How many connections can Azure Data Factory use?
90+
What are some of the connections that can be used by Azure Data Factory?
- Power Automate
- Kafka
- Apache Spark
- Logstash
How does Azure Data Factory increase data quality?
Applies transformation logic to data
What are the two ways that ETL can be set up in Azure Data Factory?
- Using the GUI
- Specifying the data processing pipeline programmatically (such as using JSON files for configuration)
What do unified data stores do?
Hold the outputs from data integration processes
What are the different types of unified data stores?
- Data Warehouses
- Data Lakes
- Master Data Management
- Data Federation
- Data Virtualisation
How do Data Warehouses work as a unifed data store?
- Centralised repository
- Store large amounts of data in a standardised format (usually in tables) to enable efficient data analysis
What do Data Warehouses rely on ETL processes for?
Periodically copying data physically to the centralised storage
How do Data Lakes work as a unified data store?
- Centralised repository
- Stores data from different sources in a raw format (data is not transformed to a standardised format)
What is the key difference between data lakes and data warehouses when acting as unified storage?
Data Lakes store data in a raw, unordered manner while Data Warehouses store data in standardised, ordered format
Why is Master Data Management important for business?
*Stores an organisation’s critical data
* Provides data reconciliation and standardisation tools
Where does data in Master Data Management systems come from?
Multiple sources
What are the two key characteristics of a Master Data Management system and why?
- “General Truth”
- Less Volatile due to the fact that they are updated infrequently (employee ID numbers for example)
What is Data Federation?
Combines data from multiple sources to create a virtual, logical view (i.e. data model) without copying the data to centralised storage
What is Data Virtualisation?
Offers a virtual layer providing a unified view of one or more data sources without copying the data to centralised storage
What is the difference between data federation and data virtualisation?
Data virtualisation does not need to integrate multiple sources; it can be implemented on a single data source, providing a personalised view through abstraction and transformation of data