Data Quality and Data Governance Flashcards

Question 1

Q

Why are data integration principles needed?

Answer

A

Data is integrated from multiple sources
Different technologies and processes are needed
Multiple challenges to be addressed

Question 2

Q

What are the data integration principles?

Answer

A

Standardisation
Reconciliation
Validation
Transformation
Cleansing
Enrichment
Privacy

Question 3

Q

What is data standardisation?

Answer

A

Transforming data from different sources into a common format and structure (such as ISO date format)

Question 4

Q

What is data reconciliation?

Answer

A

Focusing on data consistency by resolving issues of inconsistency between data from different sources

Question 5

Q

Why might data inconsistencies arise?

Answer

A

Integrating distinct data sources where entries may differ
Failures during integration
Data that is copied or transformed incorrectly
Missing / duplicate records or values
Incorrectly formatted values
Broken relationships between tables

Question 6

Q

What is data validation?

Answer

A

Assessing the accuracy and completeness of data from different sources with meeting defined constaints

Question 7

Q

What are examples of data validation?

Answer

A

Checking that email addresses, phone numbers or postcodes are complete and follow a set pattern

Question 8

Q

What is data transformation?

Answer

A

Converting data to a specified schema

Question 9

Q

What is data cleansing?

Answer

A

Removing inconsistent data such as duplicate, irrelevant or incorrect data

Question 10

Q

What might data cleanising be used to check?

Answer

A

Two records for the same entity which have slightly different entries

Question 11

Q

When is data cleansing applied?

Answer

A

During Gross Error Detection

Question 12

Q

What is Gross Error Detection?

Answer

A

A process to discard erroneous data using outlier detection methods

Question 13

Q

When might Gross Error Detection be used?

Answer

A

During integration of raw sensor measurements which may be subject to calibration uncertainties or instrument failures

Question 14

Q

What is data enrichment?

Answer

A

Including supplementary data sources to add further insights and value to the original data

Question 15

Q

What is data privacy?

Answer

A

Ensures the protection of personal rights and confidentiality of personal data during integration
Sensitive data must be encrypted or obfuscated during integration

Question 16

Q

Why is data quality more of a social challenge than a technical challenge?

Answer

A

The way data is collected, organised, modified, accessed, interacted with and conclusions drawn from the data is a communication effort.

Question 17

Q

What is offered by typical ETL solutions?

Answer

A

Rich toolsets for integrating data in line with defined policies and standards

Question 18

Q

How do general purpose cloud-based ETL tools work?

Answer

A

Pipeline data processing in a standardised way

Question 19

Q

What are some of the general purpose cloud-based ETL tools available?

Answer

A

Azure Data Factory
AWS Glue
Google Cloud Fusion
Good Cloud Dataflow

Question 20

Q

What are other available data mapping and processing solutions available and how do they work?

Answer

A

Allow data integration from various sources to blend, map, cleanse and diversify data
Pentaho Data Integration
Talent Data Integration
IBM InfoSphere
Informativa PowerCenter
Miscrosoft SQL Server Integration Services (SSIS)
IBM InfoSphere DataStage
Oracle Data Integrator
Feature Manipulation Engine
Altove MapForce Platform
Apache Nifi

Question 21

Q

What data mapping and processing solutions can be used as standalone data reconsiliation tools?

Answer

A

Open Refine
TIBCO Clarity
Winpure

Question 22

Q

What data mapping and processing solutions can carry out data enrichment?

Answer

A

Clearbit
Pipl
FullContact

Question 23

Q

What is Apache Nifi?

Answer

A

An open sources solution which can be used for distributed data processing by defining processing pipelines

Question 24

Q

What connectors does Apache Nifi offer?

Answer

A

A number of connectors compatible with different sources

Question 25

Q

What interface does Apache Nifi offer?

Answer

A

Uses a web interface to construct dataflow pipeline in a GUI

Question 26

Q

What is the set up of Apache Nifi?

Answer

A

Uses parallel Java Virtual Machines (JVM)
Has a flow-based distributed architecture

Question 27

Q

What are the elements that make up the Apache Nifi architecture?

Answer

A

Flow controller nodes: supervise the execution of threads
Processor nodes: perform the ETL processing

Question 28

Q

What are the responsibilities of the processors in Apache Nifi?

Answer

A

Pulling data from external sources
Publication
Routing data to external sources
Transforming and recovering information from Flowfiles

Question 29

Q

What are Apache Nifi’s Flowfiles?

Answer

A

The basic unit of information
Data objects that move through Nifi and hold data content as key-value pairs

Question 30

Q

What does Apache Nifi use to keep its cluster of machines organised and consistent?

Answer

A

Zookeeper

Question 31

Q

What repositories are used to manage workflow in Apache Nifi?

Answer

A

Flowfile repository: tracks worklow by recording metadata about how each data object is processed
Content repository: stores the transferred data
Provenance repository: holds data transfer events (i.e. Flowfile history)

Question 32

Q

What are the benefits of using Apache Nifi?

Answer

A

Highly scalable
Can process large amouts of data in a reasonable time through parallel processing
Can be run as a single instance or operated within a cluster, managed by an orchestrator such as Zookeeper

Question 33

Q

What is Azure Data Factory?

Answer

A

A service for the orchestration of data movement and transformation on the Azure platform

Question 34

Q

How many connections can Azure Data Factory use?

Question 35

Q

What are some of the connections that can be used by Azure Data Factory?

Answer

A

Power Automate
Kafka
Apache Spark
Logstash

Question 36

Q

How does Azure Data Factory increase data quality?

Answer

A

Applies transformation logic to data

Question 37

Q

What are the two ways that ETL can be set up in Azure Data Factory?

Answer

A

Using the GUI
Specifying the data processing pipeline programmatically (such as using JSON files for configuration)

Question 38

Q

What do unified data stores do?

Answer

A

Hold the outputs from data integration processes

Question 39

Q

What are the different types of unified data stores?

Answer

A

Data Warehouses
Data Lakes
Master Data Management
Data Federation
Data Virtualisation

Question 40

Q

How do Data Warehouses work as a unifed data store?

Answer

A

Centralised repository
Store large amounts of data in a standardised format (usually in tables) to enable efficient data analysis

Question 41

Q

What do Data Warehouses rely on ETL processes for?

Answer

A

Periodically copying data physically to the centralised storage

Question 42

Q

How do Data Lakes work as a unified data store?

Answer

A

Centralised repository
Stores data from different sources in a raw format (data is not transformed to a standardised format)

Question 43

Q

What is the key difference between data lakes and data warehouses when acting as unified storage?

Answer

A

Data Lakes store data in a raw, unordered manner while Data Warehouses store data in standardised, ordered format

Question 44

Q

Why is Master Data Management important for business?

Answer

A

*Stores an organisation’s critical data
* Provides data reconciliation and standardisation tools

Question 45

Q

Where does data in Master Data Management systems come from?

Answer

A

Multiple sources

Question 46

Q

What are the two key characteristics of a Master Data Management system and why?

Answer

A

“General Truth”
Less Volatile due to the fact that they are updated infrequently (employee ID numbers for example)

Question 47

Q

What is Data Federation?

Answer

A

Combines data from multiple sources to create a virtual, logical view (i.e. data model) without copying the data to centralised storage

Question 48

Q

What is Data Virtualisation?

Answer

A

Offers a virtual layer providing a unified view of one or more data sources without copying the data to centralised storage

Question 49

Q

What is the difference between data federation and data virtualisation?

Answer

A

Data virtualisation does not need to integrate multiple sources; it can be implemented on a single data source, providing a personalised view through abstraction and transformation of data

Question 50

Q

Why is data virtualisation used?

Answer

A

To hide the complexity of raw data stores to make accessing and working with data easier

Question 51

Q

What are the different techniques used in data virtualisation?

Answer

A

Abstraction
Virtual Data Access
Data transformation

Question 52

Q

What is abstraction in data virtualisation?

Answer

A

Seperating the data from its physical source to present the data in a consistent and unified format

Question 53

Q

What does an abstration layer do?

Answer

A

Enables data to be accessed and managed from multiple sources, regardless of the underlying data’s diversity

Question 54

Q

What is virtual data access in data virtualisation?

Answer

A

The capability to access data from different sources without physically copying the data

Question 55

Q

How might virtual data access be implemented?

Answer

A

By using a front-end application that has pointers linking to the original data sources

Question 56

Q

How is data transformation used in data virtualisation?

Answer

A

Used on the fly
Creates standardised and clean data views which match a standard data model that is easy to access and analyse

Question 57

Q

What are the strengths of data virtualisation?

Answer

A

Easy access to data
Fast prototyping by adapting data views
Easily expandible to accommodate additional data sources
Storage space is optimised
Data duplicates are avoided

Question 58

Q

What are the weaknesses of data virtualisation?

Answer

A

Performance loss due to needing to access data sources
No data versioning

Question 59

Q

How does data caching work in data virtualisation?

Answer

A

Loads data from its source into a cache such as an in-memory database

Question 60

Q

How do data services work in data virtualisation?

Answer

A

Use API endpoints to expose data

Question 61

Q

What blurs the line between data virtualisation and Data as a Service?

Answer

A

Using solutions such as RESTful APIs

Question 62

Q

What software solutions are available for data virtualisation?

Answer

A

Tibco
Informatica Power Center
Denodo
IBM Cloud Pak

Question 63

Q

What is Data as a Service?

Answer

A

Model of access to data over a network as a service for data users
It is a strategy as well as a technical solution which provides data as a value added service

Question 64

Q

Where is DaaS used?

Answer

A

Internally as a data democratisation solution
Externally as a commercial asset

Answer 64

A

Scalable solutions
Provide easy access to high quality data from large integrated data sources
Customers access data through APIs or a web interface

Answer 65

A

All data management tasks including:
* Infrastructure maintenance
* Storage
* Security
* Data backups

Answer 66

A

Used for storing large and complex datasets in the cloud. These would be too expensive or complicated to manage in-house
Data is integrated into a unique, high quality view

Answer 67

A

Public or private databases
Corporate data
Social media

Answer 68

A

Simplicity as minimal effort needed for setup
Scalability through dynamic resouce allocation
Maintenance is the responsibility of the provider
Interoperability
Cost effective

Answer 69

A

Reliant on metadata provided by the service provider. This runs the risk of incomplete or ambigious metadata which may result in wrong conclusions
Limited analysis capabilities. Some analytic applications may perform worse than if they had access to the original datasets
High bandwidth usage
Security measures needed to protect data and ensure data integrity

Answer 70

A

Integrates various sources such as data warehouses, operational databases, data lakes or other external data into a unified view

Answer 71

A

Through standardised APIs

Answer 72

A

DaaS provider
Security
Documentation
Data quality management
Infrastructure maintenance
Orchestration

Answer 73

A

Data Driven Strategies
Commercialisation

Answer 74

A

Used to manage large amounts of data
Democratise access to the organisation’s data
Monetise data as a business asset

Answer 75

A

Help to overcome inconsistent or incorrect data that may arise from data silos or isolated datasets

Answer 76

A

Aggregation of data from multiple sources and curating data to create high quality data that can be used to reach potential customers

Answer 77

A

Organisations that carry out systematic data collection through web scraping and massive data collection with the intent to sell this on to other organisations

Answer 78

A

An organisation’s data is an important asset to the business
Data insights can help improve processes, reveal ineffiencies or gain market advantage over competitors

Answer 79

A

Availability
Integrity
Usability
Security

Answer 80

A

To ensure that data are accurate, consistent, secure and compliant with regulations in all phases of the data lifecycle

Answer 81

A

Defined standards, policies and procedures for data management

Answer 82

A

Data analysts
IT personnel
Data users
Administrative personnel

Answer 83

A

Data management is frameworks for processes to handle the whole data lifecycle
Data quality management is the acknowledgement that data quality is a communicative rather than a technical challenge

Answer 84

A

Technical requirement analysis
Implementation of technical solutions

Answer 85

A

Identified inconsistencies and designs process to ensure data is accurate, consistent, relevant and complete
Involes root cause analysis for flawed data
Also includes installing standards and guidelines to prevent flawed data

Answer 86

A

May lead to decisions based on flawed data analysis such as bad business decisions or adverse medical advice

Answer 87

A

Data Ingestion
Changes in the system
Changes in the data

Answer 88

A

Initial data conversion including insufficient metadata
System consolidations combining contradictory data
Manual data entry is an obvious potential for errors
Batch feeds and real-time interfaces may cause data to become outdated quickly

Answer 89

A

Changes to the source system may not be propagated to the target system
Subsystems may not capture changes made during system updates
Unintended data use risks misinterpretation or insufficient data protection
Loss of expertise or process automation risks quality loss due to insufficient documentation

Answer 90

A

Relates to data processing and cleansing
Changes to data over time can lead to data inconsistency

Answer 91

A

To measure the effectiveness of data quality

Answer 92

A

The Computer Science Perspective
The Data Consumer Perspective
The Pragmatic Perspective

Answer 93

A

Uses the ISO 8000-8 standard
Focuses on three data quality types; syntactic, semantic and pragmatic

Answer 94

A

Four data categories which give guidance to improve data quality
Intrinsic; believability, accuracy, objectivity, reputation
Representational; interpretability, comprehensible, consistency, concise
Accessibility; access security and accessibility
Contextual; value added, relevancy, timeliness, completeness, appropriate amount

Answer 95

A

Accuracy: data are free from errors or typos
Validity: data follows the rules, constraints and formats defined by data policies and standards
Completeness; the extent to which a dataset covers a topic
Timeliness: data is available when required
Consistency: data follows uniform standards through all analysis and system components
Uniqueness: no redundant data entries in the system

Answer 96

A

Used to evaluate how well a data object in the data store describes a real world object
Can provide insights into the effectiveness of data quality guidelines

Answer 97

A

To point to data quality decay over time
To assess new data sources before they are ingested into the productive system
They should be evaluated regularly

Answer 98

A

The number of false entries in a dataset
The number of primary or foreign key errors
The total data volume of the system
The time spent to handle data
The number of unauthorised accesses

Answer 99

A

Difficult to measure
Often not carried out on a regular basis
No problem identification, only tell us that there is a problem and not what the problem is

Answer 100

A

Defines policies and proceedures for data management
Establishes the data governance framework

Answer 101

A

Stakeholders from different departments within the organisation

Answer 102

A

Responsible for the overall management and strategy of the data assets
If there is a data governance board, then the CDO collaborates with the board to define policies and standards

Answer 103

A

Controls and ensures that data policies and standards are implemented throughout the entire organisation

Answer 104

A

Responsible for managing specific datasets or data processes

Answer 105

A

An IT professional who analyses the data to get further insights to inform decision making

Answer 106

A

The end user who consumes the data
Can be a person or an application

Answer 107

A

Cares about technical aspects such as cloud architecture and storage solutions

Answer 108

A

Implements security measures to protect the organisation’s data

Answer 109

A

Used alongside ISO 8000-61
Make up a systematic framework that sets out a list of measures that can improve data quality
There are 5 distinct levels

Answer 110

A

Data are only processed for specific use cases
Requires knowledge of the end user’s requirements
No policies or processes describing data processing

Answer 111

A

Clear specifications and work instructions for data structure and processing
Data quality and control is continuously monitored

Answer 112

A

Focuses on data quality planning
Includes data-related support and resouce provisioning
Strategic long-term data quality plans and their respective policies and standards are developed

Answer 113

A

An overall assessment of the entire system takes place
Characteristics describing the overall effectiveness of the system are identified
Data quality assurance is introduced as an additional evaluation step

Answer 114

A

Data quality is integrated into all operations
Constant awareness raising for data quality
Periodic root cause analysis for data quality issues
Managed system enabling continuous improvement

Answer 115

A

Top-down
Botton-up
Hybrid

Answer 116

A

Data governancee strategy is defined by top management
Middle managers are assigned responsibilities for implementing the transformation plan

Answer 117

A

Because it is led by top management, provides a unique and long-term view of the data governance strategy
Imposes a solid commitment to follow the strategies

Answer 118

A

Strategies could be inadequate if senior management don’t understand the details or needs
Can result in a difficult to manage macroproject

Answer 119

A

Data governance implementation is initiated by the operational units
Projects might be smaller (such as single departments)
Easier to manage
Starts by implementing a data management tool for a single asset

Answer 120

A

Data governance strategy has no unique vision
Different departments can follow different directions
Cooperation between departments can be problematic

Answer 121

A

Combines the top-down and bottom-up approach
Top managers define the company’s vision for data governance
Operational units carry out the implementation as an agile and iterative process, whilst considering the function requirements for the data processes and constraints

Answer 122

A

Available resources
Already existing processes
Need for business continuity

Brainscape's Knowledge GenomeTM

Data Quality and Data Governance Flashcards

Brainscape's Knowledge Genome^TM