Data Quality and Data Governance Flashcards

1
Q

Why are data integration principles needed?

A
  • Data is integrated from multiple sources
  • Different technologies and processes are needed
  • Multiple challenges to be addressed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the data integration principles?

A
  • Standardisation
  • Reconciliation
  • Validation
  • Transformation
  • Cleansing
  • Enrichment
  • Privacy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is data standardisation?

A

Transforming data from different sources into a common format and structure (such as ISO date format)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is data reconciliation?

A

Focusing on data consistency by resolving issues of inconsistency between data from different sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why might data inconsistencies arise?

A
  • Integrating distinct data sources where entries may differ
  • Failures during integration
  • Data that is copied or transformed incorrectly
  • Missing / duplicate records or values
  • Incorrectly formatted values
  • Broken relationships between tables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is data validation?

A

Assessing the accuracy and completeness of data from different sources with meeting defined constaints

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are examples of data validation?

A

Checking that email addresses, phone numbers or postcodes are complete and follow a set pattern

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is data transformation?

A

Converting data to a specified schema

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is data cleansing?

A

Removing inconsistent data such as duplicate, irrelevant or incorrect data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What might data cleanising be used to check?

A

Two records for the same entity which have slightly different entries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When is data cleansing applied?

A

During Gross Error Detection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Gross Error Detection?

A

A process to discard erroneous data using outlier detection methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When might Gross Error Detection be used?

A

During integration of raw sensor measurements which may be subject to calibration uncertainties or instrument failures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is data enrichment?

A

Including supplementary data sources to add further insights and value to the original data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is data privacy?

A
  • Ensures the protection of personal rights and confidentiality of personal data during integration
  • Sensitive data must be encrypted or obfuscated during integration
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is data quality more of a social challenge than a technical challenge?

A

The way data is collected, organised, modified, accessed, interacted with and conclusions drawn from the data is a communication effort.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is offered by typical ETL solutions?

A

Rich toolsets for integrating data in line with defined policies and standards

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do general purpose cloud-based ETL tools work?

A

Pipeline data processing in a standardised way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are some of the general purpose cloud-based ETL tools available?

A
  • Azure Data Factory
  • AWS Glue
  • Google Cloud Fusion
  • Good Cloud Dataflow
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are other available data mapping and processing solutions available and how do they work?

A
  • Allow data integration from various sources to blend, map, cleanse and diversify data
  • Pentaho Data Integration
  • Talent Data Integration
  • IBM InfoSphere
  • Informativa PowerCenter
  • Miscrosoft SQL Server Integration Services (SSIS)
  • IBM InfoSphere DataStage
  • Oracle Data Integrator
  • Feature Manipulation Engine
  • Altove MapForce Platform
  • Apache Nifi
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What data mapping and processing solutions can be used as standalone data reconsiliation tools?

A
  • Open Refine
  • TIBCO Clarity
  • Winpure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What data mapping and processing solutions can carry out data enrichment?

A
  • Clearbit
  • Pipl
  • FullContact
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is Apache Nifi?

A

An open sources solution which can be used for distributed data processing by defining processing pipelines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What connectors does Apache Nifi offer?

A

A number of connectors compatible with different sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What interface does Apache Nifi offer?

A

Uses a web interface to construct dataflow pipeline in a GUI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the set up of Apache Nifi?

A
  • Uses parallel Java Virtual Machines (JVM)
  • Has a flow-based distributed architecture
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are the elements that make up the Apache Nifi architecture?

A
  • Flow controller nodes: supervise the execution of threads
  • Processor nodes: perform the ETL processing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are the responsibilities of the processors in Apache Nifi?

A
  • Pulling data from external sources
  • Publication
  • Routing data to external sources
  • Transforming and recovering information from Flowfiles
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What are Apache Nifi’s Flowfiles?

A
  • The basic unit of information
  • Data objects that move through Nifi and hold data content as key-value pairs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What does Apache Nifi use to keep its cluster of machines organised and consistent?

A

Zookeeper

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What repositories are used to manage workflow in Apache Nifi?

A
  • Flowfile repository: tracks worklow by recording metadata about how each data object is processed
  • Content repository: stores the transferred data
  • Provenance repository: holds data transfer events (i.e. Flowfile history)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What are the benefits of using Apache Nifi?

A
  • Highly scalable
  • Can process large amouts of data in a reasonable time through parallel processing
  • Can be run as a single instance or operated within a cluster, managed by an orchestrator such as Zookeeper
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is Azure Data Factory?

A

A service for the orchestration of data movement and transformation on the Azure platform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

How many connections can Azure Data Factory use?

A

90+

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What are some of the connections that can be used by Azure Data Factory?

A
  • Power Automate
  • Kafka
  • Apache Spark
  • Logstash
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How does Azure Data Factory increase data quality?

A

Applies transformation logic to data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What are the two ways that ETL can be set up in Azure Data Factory?

A
  • Using the GUI
  • Specifying the data processing pipeline programmatically (such as using JSON files for configuration)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What do unified data stores do?

A

Hold the outputs from data integration processes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What are the different types of unified data stores?

A
  • Data Warehouses
  • Data Lakes
  • Master Data Management
  • Data Federation
  • Data Virtualisation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

How do Data Warehouses work as a unifed data store?

A
  • Centralised repository
  • Store large amounts of data in a standardised format (usually in tables) to enable efficient data analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What do Data Warehouses rely on ETL processes for?

A

Periodically copying data physically to the centralised storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

How do Data Lakes work as a unified data store?

A
  • Centralised repository
  • Stores data from different sources in a raw format (data is not transformed to a standardised format)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is the key difference between data lakes and data warehouses when acting as unified storage?

A

Data Lakes store data in a raw, unordered manner while Data Warehouses store data in standardised, ordered format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Why is Master Data Management important for business?

A

*Stores an organisation’s critical data
* Provides data reconciliation and standardisation tools

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Where does data in Master Data Management systems come from?

A

Multiple sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What are the two key characteristics of a Master Data Management system and why?

A
  • “General Truth”
  • Less Volatile due to the fact that they are updated infrequently (employee ID numbers for example)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What is Data Federation?

A

Combines data from multiple sources to create a virtual, logical view (i.e. data model) without copying the data to centralised storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is Data Virtualisation?

A

Offers a virtual layer providing a unified view of one or more data sources without copying the data to centralised storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is the difference between data federation and data virtualisation?

A

Data virtualisation does not need to integrate multiple sources; it can be implemented on a single data source, providing a personalised view through abstraction and transformation of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Why is data virtualisation used?

A

To hide the complexity of raw data stores to make accessing and working with data easier

51
Q

What are the different techniques used in data virtualisation?

A
  • Abstraction
  • Virtual Data Access
  • Data transformation
52
Q

What is abstraction in data virtualisation?

A

Seperating the data from its physical source to present the data in a consistent and unified format

53
Q

What does an abstration layer do?

A

Enables data to be accessed and managed from multiple sources, regardless of the underlying data’s diversity

54
Q

What is virtual data access in data virtualisation?

A

The capability to access data from different sources without physically copying the data

55
Q

How might virtual data access be implemented?

A

By using a front-end application that has pointers linking to the original data sources

56
Q

How is data transformation used in data virtualisation?

A
  • Used on the fly
  • Creates standardised and clean data views which match a standard data model that is easy to access and analyse
57
Q

What are the strengths of data virtualisation?

A
  • Easy access to data
  • Fast prototyping by adapting data views
  • Easily expandible to accommodate additional data sources
  • Storage space is optimised
  • Data duplicates are avoided
58
Q

What are the weaknesses of data virtualisation?

A
  • Performance loss due to needing to access data sources
  • No data versioning
59
Q

How does data caching work in data virtualisation?

A

Loads data from its source into a cache such as an in-memory database

60
Q

How do data services work in data virtualisation?

A

Use API endpoints to expose data

61
Q

What blurs the line between data virtualisation and Data as a Service?

A

Using solutions such as RESTful APIs

62
Q

What software solutions are available for data virtualisation?

A
  • Tibco
  • Informatica Power Center
  • Denodo
  • IBM Cloud Pak
63
Q

What is Data as a Service?

A
  • Model of access to data over a network as a service for data users
  • It is a strategy as well as a technical solution which provides data as a value added service
64
Q

Where is DaaS used?

A
  • Internally as a data democratisation solution
  • Externally as a commercial asset
65
Q

What are the characteristics of DaaS?

A
  • Scalable solutions
  • Provide easy access to high quality data from large integrated data sources
  • Customers access data through APIs or a web interface
66
Q

What data management tasks is a DaaS provider responsible for?

A

All data management tasks including:
* Infrastructure maintenance
* Storage
* Security
* Data backups

67
Q

Why use DaaS?

A
  • Used for storing large and complex datasets in the cloud. These would be too expensive or complicated to manage in-house
  • Data is integrated into a unique, high quality view
68
Q

What sources is DaaS data integrated from?

A
  • Public or private databases
  • Corporate data
  • Social media
69
Q

What are the strengths of using DaaS for a business?

A
  • Simplicity as minimal effort needed for setup
  • Scalability through dynamic resouce allocation
  • Maintenance is the responsibility of the provider
  • Interoperability
  • Cost effective
70
Q

What are the weaknesses of using DaaS for a buisness?

A
  • Reliant on metadata provided by the service provider. This runs the risk of incomplete or ambigious metadata which may result in wrong conclusions
  • Limited analysis capabilities. Some analytic applications may perform worse than if they had access to the original datasets
  • High bandwidth usage
  • Security measures needed to protect data and ensure data integrity
71
Q

How does DaaS architecture use data virtualisation?

A

Integrates various sources such as data warehouses, operational databases, data lakes or other external data into a unified view

72
Q

How do DaaS providers offer access to the data?

A

Through standardised APIs

73
Q

Who is reponsible for API management in DaaS and what does this cover?

A
  • DaaS provider
  • Security
  • Documentation
  • Data quality management
  • Infrastructure maintenance
  • Orchestration
74
Q

What scenarios might DaaS be applied to?

A
  • Data Driven Strategies
  • Commercialisation
75
Q

How can DaaS be used for Data Driven Strategies?

A
  • Used to manage large amounts of data
  • Democratise access to the organisation’s data
  • Monetise data as a business asset
76
Q

What is the advantage of aggregating data to support data driven strategies?

A

Help to overcome inconsistent or incorrect data that may arise from data silos or isolated datasets

77
Q

How can DaaS be used for commercialisation?

A
  • Aggregation of data from multiple sources and curating data to create high quality data that can be used to reach potential customers
78
Q

What are Data Leeches?

A

Organisations that carry out systematic data collection through web scraping and massive data collection with the intent to sell this on to other organisations

79
Q

Why is Data Governance important?

A
  • An organisation’s data is an important asset to the business
  • Data insights can help improve processes, reveal ineffiencies or gain market advantage over competitors
80
Q

What characteristics of the data does data governance focus on to maximise its value to the organisation?

A
  • Availability
  • Integrity
  • Usability
  • Security
81
Q

What is the specific aim of data governance?

A

To ensure that data are accurate, consistent, secure and compliant with regulations in all phases of the data lifecycle

82
Q

What is data governance based on?

A

Defined standards, policies and procedures for data management

83
Q

Which stakeholders are involved in data governance?

A
  • Data analysts
  • IT personnel
  • Data users
  • Administrative personnel
84
Q

What is the difference between data management and data quality management?

A
  • Data management is frameworks for processes to handle the whole data lifecycle
  • Data quality management is the acknowledgement that data quality is a communicative rather than a technical challenge
85
Q

What activities come under data management?

A
  • Technical requirement analysis
  • Implementation of technical solutions
86
Q

What does data quality management do?

A
  • Identified inconsistencies and designs process to ensure data is accurate, consistent, relevant and complete
  • Involes root cause analysis for flawed data
  • Also includes installing standards and guidelines to prevent flawed data
87
Q

What are the consequences of issues with data quality?

A

May lead to decisions based on flawed data analysis such as bad business decisions or adverse medical advice

88
Q

What are the three stages that can potentially cause issues with data quality?

A
  • Data Ingestion
  • Changes in the system
  • Changes in the data
89
Q

What errors in data ingestion may cause issues with data quality?

A
  • Initial data conversion including insufficient metadata
  • System consolidations combining contradictory data
  • Manual data entry is an obvious potential for errors
  • Batch feeds and real-time interfaces may cause data to become outdated quickly
90
Q

What errors in changes to the system may cause issues with data quality?

A
  • Changes to the source system may not be propagated to the target system
  • Subsystems may not capture changes made during system updates
  • Unintended data use risks misinterpretation or insufficient data protection
  • Loss of expertise or process automation risks quality loss due to insufficient documentation
91
Q

What errors in changes to the data may cause issues with data quality?

A
  • Relates to data processing and cleansing
  • Changes to data over time can lead to data inconsistency
92
Q

What is the purpose of data quality dimensions?

A

To measure the effectiveness of data quality

93
Q

What are the three perspectives that can be used in data quality dimensions?

A
  • The Computer Science Perspective
  • The Data Consumer Perspective
  • The Pragmatic Perspective
94
Q

What is the computer science perspective with respect to data quality dimensions?

A
  • Uses the ISO 8000-8 standard
  • Focuses on three data quality types; syntactic, semantic and pragmatic
95
Q

What is the Data Consumer Perspective with respect to data quality dimensions?

A
  • Four data categories which give guidance to improve data quality
  • Intrinsic; believability, accuracy, objectivity, reputation
  • Representational; interpretability, comprehensible, consistency, concise
  • Accessibility; access security and accessibility
  • Contextual; value added, relevancy, timeliness, completeness, appropriate amount
96
Q

What is the pragmatic perspective with respect to data quality dimensions?

A
  • Accuracy: data are free from errors or typos
  • Validity: data follows the rules, constraints and formats defined by data policies and standards
  • Completeness; the extent to which a dataset covers a topic
  • Timeliness: data is available when required
  • Consistency: data follows uniform standards through all analysis and system components
  • Uniqueness: no redundant data entries in the system
97
Q

What are data governance metrics?

A
  • Used to evaluate how well a data object in the data store describes a real world object
  • Can provide insights into the effectiveness of data quality guidelines
98
Q

What can data governance metrics be used for?

A
  • To point to data quality decay over time
  • To assess new data sources before they are ingested into the productive system
  • They should be evaluated regularly
99
Q

What are examples of technical data governance metrics?

A
  • The number of false entries in a dataset
  • The number of primary or foreign key errors
  • The total data volume of the system
  • The time spent to handle data
  • The number of unauthorised accesses
100
Q

What are drawbacks for data governance metrics?

A
  • Difficult to measure
  • Often not carried out on a regular basis
  • No problem identification, only tell us that there is a problem and not what the problem is
101
Q

What is the role of the Data Governance board?

A
  • Defines policies and proceedures for data management
  • Establishes the data governance framework
102
Q

Who makes up the Data Governance board?

A

Stakeholders from different departments within the organisation

103
Q

What is the role of the Chief Data Officer in Data Governance?

A
  • Responsible for the overall management and strategy of the data assets
  • If there is a data governance board, then the CDO collaborates with the board to define policies and standards
104
Q

What is the role of the Data Steward in Data Governance?

A

Controls and ensures that data policies and standards are implemented throughout the entire organisation

105
Q

What is the role of Data Owners in Data Governance?

A

Responsible for managing specific datasets or data processes

106
Q

What is the role of the Data Analyst in Data Governance?

A

An IT professional who analyses the data to get further insights to inform decision making

107
Q

What is the role of the Data User in Data Governance?

A
  • The end user who consumes the data
  • Can be a person or an application
108
Q

What is the role of the Data Manager in Data Governance?

A

Cares about technical aspects such as cloud architecture and storage solutions

109
Q

What is the role of the Information Security Officer in Data Governance?

A

Implements security measures to protect the organisation’s data

110
Q

What are Data Quality Management Capability levels?

A
  • Used alongside ISO 8000-61
  • Make up a systematic framework that sets out a list of measures that can improve data quality
  • There are 5 distinct levels
111
Q

What happens at level one of the Data Quality Management Capability framework?

A
  • Data are only processed for specific use cases
  • Requires knowledge of the end user’s requirements
  • No policies or processes describing data processing
112
Q

What happens at level two of the Data Quality Management Capability framework?

A
  • Clear specifications and work instructions for data structure and processing
  • Data quality and control is continuously monitored
113
Q

What happens at level three of the Data Quality Management Capability framework?

A
  • Focuses on data quality planning
  • Includes data-related support and resouce provisioning
  • Strategic long-term data quality plans and their respective policies and standards are developed
114
Q

What happens at level four of the Data Quality Management Capability framework?

A
  • An overall assessment of the entire system takes place
  • Characteristics describing the overall effectiveness of the system are identified
  • Data quality assurance is introduced as an additional evaluation step
115
Q

What happens at level five of the Data Quality Management Capability framework?

A
  • Data quality is integrated into all operations
  • Constant awareness raising for data quality
  • Periodic root cause analysis for data quality issues
  • Managed system enabling continuous improvement
116
Q

What are the three different approaches used in Data Governance?

A
  • Top-down
  • Botton-up
  • Hybrid
117
Q

What is the top-down approach in Data Governance?

A
  • Data governancee strategy is defined by top management
  • Middle managers are assigned responsibilities for implementing the transformation plan
118
Q

What are the advantages of using the top-down approach in Data Governance?

A
  • Because it is led by top management, provides a unique and long-term view of the data governance strategy
  • Imposes a solid commitment to follow the strategies
119
Q

What are the drawbacks of using the top-down approach in Data Governance?

A
  • Strategies could be inadequate if senior management don’t understand the details or needs
  • Can result in a difficult to manage macroproject
120
Q

What is the bottom-up approach in Data Governance?

A
  • Data governance implementation is initiated by the operational units
  • Projects might be smaller (such as single departments)
  • Easier to manage
  • Starts by implementing a data management tool for a single asset
121
Q

What are the drawbacks of using the bottom-up approach in Data Governance?

A
  • Data governance strategy has no unique vision
  • Different departments can follow different directions
  • Cooperation between departments can be problematic
122
Q

What is the hybrid approach in Data Governance?

A
  • Combines the top-down and bottom-up approach
  • Top managers define the company’s vision for data governance
  • Operational units carry out the implementation as an agile and iterative process, whilst considering the function requirements for the data processes and constraints
123
Q

In the hybrid approach to Data Governance, what are the functional requirements that the operational units must take into consieration?

A
  • Available resources
  • Already existing processes
  • Need for business continuity