Data Quality and Data Governance Flashcards
Why are data integration principles needed?
- Data is integrated from multiple sources
- Different technologies and processes are needed
- Multiple challenges to be addressed
What are the data integration principles?
- Standardisation
- Reconciliation
- Validation
- Transformation
- Cleansing
- Enrichment
- Privacy
What is data standardisation?
Transforming data from different sources into a common format and structure (such as ISO date format)
What is data reconciliation?
Focusing on data consistency by resolving issues of inconsistency between data from different sources
Why might data inconsistencies arise?
- Integrating distinct data sources where entries may differ
- Failures during integration
- Data that is copied or transformed incorrectly
- Missing / duplicate records or values
- Incorrectly formatted values
- Broken relationships between tables
What is data validation?
Assessing the accuracy and completeness of data from different sources with meeting defined constaints
What are examples of data validation?
Checking that email addresses, phone numbers or postcodes are complete and follow a set pattern
What is data transformation?
Converting data to a specified schema
What is data cleansing?
Removing inconsistent data such as duplicate, irrelevant or incorrect data
What might data cleanising be used to check?
Two records for the same entity which have slightly different entries
When is data cleansing applied?
During Gross Error Detection
What is Gross Error Detection?
A process to discard erroneous data using outlier detection methods
When might Gross Error Detection be used?
During integration of raw sensor measurements which may be subject to calibration uncertainties or instrument failures
What is data enrichment?
Including supplementary data sources to add further insights and value to the original data
What is data privacy?
- Ensures the protection of personal rights and confidentiality of personal data during integration
- Sensitive data must be encrypted or obfuscated during integration
Why is data quality more of a social challenge than a technical challenge?
The way data is collected, organised, modified, accessed, interacted with and conclusions drawn from the data is a communication effort.
What is offered by typical ETL solutions?
Rich toolsets for integrating data in line with defined policies and standards
How do general purpose cloud-based ETL tools work?
Pipeline data processing in a standardised way
What are some of the general purpose cloud-based ETL tools available?
- Azure Data Factory
- AWS Glue
- Google Cloud Fusion
- Good Cloud Dataflow
What are other available data mapping and processing solutions available and how do they work?
- Allow data integration from various sources to blend, map, cleanse and diversify data
- Pentaho Data Integration
- Talent Data Integration
- IBM InfoSphere
- Informativa PowerCenter
- Miscrosoft SQL Server Integration Services (SSIS)
- IBM InfoSphere DataStage
- Oracle Data Integrator
- Feature Manipulation Engine
- Altove MapForce Platform
- Apache Nifi
What data mapping and processing solutions can be used as standalone data reconsiliation tools?
- Open Refine
- TIBCO Clarity
- Winpure
What data mapping and processing solutions can carry out data enrichment?
- Clearbit
- Pipl
- FullContact
What is Apache Nifi?
An open sources solution which can be used for distributed data processing by defining processing pipelines
What connectors does Apache Nifi offer?
A number of connectors compatible with different sources
What interface does Apache Nifi offer?
Uses a web interface to construct dataflow pipeline in a GUI
What is the set up of Apache Nifi?
- Uses parallel Java Virtual Machines (JVM)
- Has a flow-based distributed architecture
What are the elements that make up the Apache Nifi architecture?
- Flow controller nodes: supervise the execution of threads
- Processor nodes: perform the ETL processing
What are the responsibilities of the processors in Apache Nifi?
- Pulling data from external sources
- Publication
- Routing data to external sources
- Transforming and recovering information from Flowfiles
What are Apache Nifi’s Flowfiles?
- The basic unit of information
- Data objects that move through Nifi and hold data content as key-value pairs
What does Apache Nifi use to keep its cluster of machines organised and consistent?
Zookeeper
What repositories are used to manage workflow in Apache Nifi?
- Flowfile repository: tracks worklow by recording metadata about how each data object is processed
- Content repository: stores the transferred data
- Provenance repository: holds data transfer events (i.e. Flowfile history)
What are the benefits of using Apache Nifi?
- Highly scalable
- Can process large amouts of data in a reasonable time through parallel processing
- Can be run as a single instance or operated within a cluster, managed by an orchestrator such as Zookeeper
What is Azure Data Factory?
A service for the orchestration of data movement and transformation on the Azure platform
How many connections can Azure Data Factory use?
90+
What are some of the connections that can be used by Azure Data Factory?
- Power Automate
- Kafka
- Apache Spark
- Logstash
How does Azure Data Factory increase data quality?
Applies transformation logic to data
What are the two ways that ETL can be set up in Azure Data Factory?
- Using the GUI
- Specifying the data processing pipeline programmatically (such as using JSON files for configuration)
What do unified data stores do?
Hold the outputs from data integration processes
What are the different types of unified data stores?
- Data Warehouses
- Data Lakes
- Master Data Management
- Data Federation
- Data Virtualisation
How do Data Warehouses work as a unifed data store?
- Centralised repository
- Store large amounts of data in a standardised format (usually in tables) to enable efficient data analysis
What do Data Warehouses rely on ETL processes for?
Periodically copying data physically to the centralised storage
How do Data Lakes work as a unified data store?
- Centralised repository
- Stores data from different sources in a raw format (data is not transformed to a standardised format)
What is the key difference between data lakes and data warehouses when acting as unified storage?
Data Lakes store data in a raw, unordered manner while Data Warehouses store data in standardised, ordered format
Why is Master Data Management important for business?
*Stores an organisation’s critical data
* Provides data reconciliation and standardisation tools
Where does data in Master Data Management systems come from?
Multiple sources
What are the two key characteristics of a Master Data Management system and why?
- “General Truth”
- Less Volatile due to the fact that they are updated infrequently (employee ID numbers for example)
What is Data Federation?
Combines data from multiple sources to create a virtual, logical view (i.e. data model) without copying the data to centralised storage
What is Data Virtualisation?
Offers a virtual layer providing a unified view of one or more data sources without copying the data to centralised storage
What is the difference between data federation and data virtualisation?
Data virtualisation does not need to integrate multiple sources; it can be implemented on a single data source, providing a personalised view through abstraction and transformation of data
Why is data virtualisation used?
To hide the complexity of raw data stores to make accessing and working with data easier
What are the different techniques used in data virtualisation?
- Abstraction
- Virtual Data Access
- Data transformation
What is abstraction in data virtualisation?
Seperating the data from its physical source to present the data in a consistent and unified format
What does an abstration layer do?
Enables data to be accessed and managed from multiple sources, regardless of the underlying data’s diversity
What is virtual data access in data virtualisation?
The capability to access data from different sources without physically copying the data
How might virtual data access be implemented?
By using a front-end application that has pointers linking to the original data sources
How is data transformation used in data virtualisation?
- Used on the fly
- Creates standardised and clean data views which match a standard data model that is easy to access and analyse
What are the strengths of data virtualisation?
- Easy access to data
- Fast prototyping by adapting data views
- Easily expandible to accommodate additional data sources
- Storage space is optimised
- Data duplicates are avoided
What are the weaknesses of data virtualisation?
- Performance loss due to needing to access data sources
- No data versioning
How does data caching work in data virtualisation?
Loads data from its source into a cache such as an in-memory database
How do data services work in data virtualisation?
Use API endpoints to expose data
What blurs the line between data virtualisation and Data as a Service?
Using solutions such as RESTful APIs
What software solutions are available for data virtualisation?
- Tibco
- Informatica Power Center
- Denodo
- IBM Cloud Pak
What is Data as a Service?
- Model of access to data over a network as a service for data users
- It is a strategy as well as a technical solution which provides data as a value added service
Where is DaaS used?
- Internally as a data democratisation solution
- Externally as a commercial asset
What are the characteristics of DaaS?
- Scalable solutions
- Provide easy access to high quality data from large integrated data sources
- Customers access data through APIs or a web interface
What data management tasks is a DaaS provider responsible for?
All data management tasks including:
* Infrastructure maintenance
* Storage
* Security
* Data backups
Why use DaaS?
- Used for storing large and complex datasets in the cloud. These would be too expensive or complicated to manage in-house
- Data is integrated into a unique, high quality view
What sources is DaaS data integrated from?
- Public or private databases
- Corporate data
- Social media
What are the strengths of using DaaS for a business?
- Simplicity as minimal effort needed for setup
- Scalability through dynamic resouce allocation
- Maintenance is the responsibility of the provider
- Interoperability
- Cost effective
What are the weaknesses of using DaaS for a buisness?
- Reliant on metadata provided by the service provider. This runs the risk of incomplete or ambigious metadata which may result in wrong conclusions
- Limited analysis capabilities. Some analytic applications may perform worse than if they had access to the original datasets
- High bandwidth usage
- Security measures needed to protect data and ensure data integrity
How does DaaS architecture use data virtualisation?
Integrates various sources such as data warehouses, operational databases, data lakes or other external data into a unified view
How do DaaS providers offer access to the data?
Through standardised APIs
Who is reponsible for API management in DaaS and what does this cover?
- DaaS provider
- Security
- Documentation
- Data quality management
- Infrastructure maintenance
- Orchestration
What scenarios might DaaS be applied to?
- Data Driven Strategies
- Commercialisation
How can DaaS be used for Data Driven Strategies?
- Used to manage large amounts of data
- Democratise access to the organisation’s data
- Monetise data as a business asset
What is the advantage of aggregating data to support data driven strategies?
Help to overcome inconsistent or incorrect data that may arise from data silos or isolated datasets
How can DaaS be used for commercialisation?
- Aggregation of data from multiple sources and curating data to create high quality data that can be used to reach potential customers
What are Data Leeches?
Organisations that carry out systematic data collection through web scraping and massive data collection with the intent to sell this on to other organisations
Why is Data Governance important?
- An organisation’s data is an important asset to the business
- Data insights can help improve processes, reveal ineffiencies or gain market advantage over competitors
What characteristics of the data does data governance focus on to maximise its value to the organisation?
- Availability
- Integrity
- Usability
- Security
What is the specific aim of data governance?
To ensure that data are accurate, consistent, secure and compliant with regulations in all phases of the data lifecycle
What is data governance based on?
Defined standards, policies and procedures for data management
Which stakeholders are involved in data governance?
- Data analysts
- IT personnel
- Data users
- Administrative personnel
What is the difference between data management and data quality management?
- Data management is frameworks for processes to handle the whole data lifecycle
- Data quality management is the acknowledgement that data quality is a communicative rather than a technical challenge
What activities come under data management?
- Technical requirement analysis
- Implementation of technical solutions
What does data quality management do?
- Identified inconsistencies and designs process to ensure data is accurate, consistent, relevant and complete
- Involes root cause analysis for flawed data
- Also includes installing standards and guidelines to prevent flawed data
What are the consequences of issues with data quality?
May lead to decisions based on flawed data analysis such as bad business decisions or adverse medical advice
What are the three stages that can potentially cause issues with data quality?
- Data Ingestion
- Changes in the system
- Changes in the data
What errors in data ingestion may cause issues with data quality?
- Initial data conversion including insufficient metadata
- System consolidations combining contradictory data
- Manual data entry is an obvious potential for errors
- Batch feeds and real-time interfaces may cause data to become outdated quickly
What errors in changes to the system may cause issues with data quality?
- Changes to the source system may not be propagated to the target system
- Subsystems may not capture changes made during system updates
- Unintended data use risks misinterpretation or insufficient data protection
- Loss of expertise or process automation risks quality loss due to insufficient documentation
What errors in changes to the data may cause issues with data quality?
- Relates to data processing and cleansing
- Changes to data over time can lead to data inconsistency
What is the purpose of data quality dimensions?
To measure the effectiveness of data quality
What are the three perspectives that can be used in data quality dimensions?
- The Computer Science Perspective
- The Data Consumer Perspective
- The Pragmatic Perspective
What is the computer science perspective with respect to data quality dimensions?
- Uses the ISO 8000-8 standard
- Focuses on three data quality types; syntactic, semantic and pragmatic
What is the Data Consumer Perspective with respect to data quality dimensions?
- Four data categories which give guidance to improve data quality
- Intrinsic; believability, accuracy, objectivity, reputation
- Representational; interpretability, comprehensible, consistency, concise
- Accessibility; access security and accessibility
- Contextual; value added, relevancy, timeliness, completeness, appropriate amount
What is the pragmatic perspective with respect to data quality dimensions?
- Accuracy: data are free from errors or typos
- Validity: data follows the rules, constraints and formats defined by data policies and standards
- Completeness; the extent to which a dataset covers a topic
- Timeliness: data is available when required
- Consistency: data follows uniform standards through all analysis and system components
- Uniqueness: no redundant data entries in the system
What are data governance metrics?
- Used to evaluate how well a data object in the data store describes a real world object
- Can provide insights into the effectiveness of data quality guidelines
What can data governance metrics be used for?
- To point to data quality decay over time
- To assess new data sources before they are ingested into the productive system
- They should be evaluated regularly
What are examples of technical data governance metrics?
- The number of false entries in a dataset
- The number of primary or foreign key errors
- The total data volume of the system
- The time spent to handle data
- The number of unauthorised accesses
What are drawbacks for data governance metrics?
- Difficult to measure
- Often not carried out on a regular basis
- No problem identification, only tell us that there is a problem and not what the problem is
What is the role of the Data Governance board?
- Defines policies and proceedures for data management
- Establishes the data governance framework
Who makes up the Data Governance board?
Stakeholders from different departments within the organisation
What is the role of the Chief Data Officer in Data Governance?
- Responsible for the overall management and strategy of the data assets
- If there is a data governance board, then the CDO collaborates with the board to define policies and standards
What is the role of the Data Steward in Data Governance?
Controls and ensures that data policies and standards are implemented throughout the entire organisation
What is the role of Data Owners in Data Governance?
Responsible for managing specific datasets or data processes
What is the role of the Data Analyst in Data Governance?
An IT professional who analyses the data to get further insights to inform decision making
What is the role of the Data User in Data Governance?
- The end user who consumes the data
- Can be a person or an application
What is the role of the Data Manager in Data Governance?
Cares about technical aspects such as cloud architecture and storage solutions
What is the role of the Information Security Officer in Data Governance?
Implements security measures to protect the organisation’s data
What are Data Quality Management Capability levels?
- Used alongside ISO 8000-61
- Make up a systematic framework that sets out a list of measures that can improve data quality
- There are 5 distinct levels
What happens at level one of the Data Quality Management Capability framework?
- Data are only processed for specific use cases
- Requires knowledge of the end user’s requirements
- No policies or processes describing data processing
What happens at level two of the Data Quality Management Capability framework?
- Clear specifications and work instructions for data structure and processing
- Data quality and control is continuously monitored
What happens at level three of the Data Quality Management Capability framework?
- Focuses on data quality planning
- Includes data-related support and resouce provisioning
- Strategic long-term data quality plans and their respective policies and standards are developed
What happens at level four of the Data Quality Management Capability framework?
- An overall assessment of the entire system takes place
- Characteristics describing the overall effectiveness of the system are identified
- Data quality assurance is introduced as an additional evaluation step
What happens at level five of the Data Quality Management Capability framework?
- Data quality is integrated into all operations
- Constant awareness raising for data quality
- Periodic root cause analysis for data quality issues
- Managed system enabling continuous improvement
What are the three different approaches used in Data Governance?
- Top-down
- Botton-up
- Hybrid
What is the top-down approach in Data Governance?
- Data governancee strategy is defined by top management
- Middle managers are assigned responsibilities for implementing the transformation plan
What are the advantages of using the top-down approach in Data Governance?
- Because it is led by top management, provides a unique and long-term view of the data governance strategy
- Imposes a solid commitment to follow the strategies
What are the drawbacks of using the top-down approach in Data Governance?
- Strategies could be inadequate if senior management don’t understand the details or needs
- Can result in a difficult to manage macroproject
What is the bottom-up approach in Data Governance?
- Data governance implementation is initiated by the operational units
- Projects might be smaller (such as single departments)
- Easier to manage
- Starts by implementing a data management tool for a single asset
What are the drawbacks of using the bottom-up approach in Data Governance?
- Data governance strategy has no unique vision
- Different departments can follow different directions
- Cooperation between departments can be problematic
What is the hybrid approach in Data Governance?
- Combines the top-down and bottom-up approach
- Top managers define the company’s vision for data governance
- Operational units carry out the implementation as an agile and iterative process, whilst considering the function requirements for the data processes and constraints
In the hybrid approach to Data Governance, what are the functional requirements that the operational units must take into consieration?
- Available resources
- Already existing processes
- Need for business continuity