15 Data Quality and Management Flashcards
What is quality control?
The process of testing data to ensure data integrity
Quality control is essential because bad data can lead to misleading results.
When should you check for data quality?
Any time there is a major change, such as:
* Data acquisition
* Data transformation
* Data manipulation
* Final product review
Regular checks are also important beyond routine maintenance.
What is data acquisition?
The process of obtaining new data
It requires checking for bias and the current state of the data.
What does data transformation involve?
Changing data from one form to another, including:
* Intrahops
* Pass-throughs
* Conversions
Transformations should ideally be done in new variables.
What is data manipulation?
Changing the shape of the data without altering its content
Examples include breaking down or combining variables.
What are the data quality dimensions?
Key dimensions include:
* Data consistency
* Data accuracy
* Data completeness
* Data integrity
* Data attribute limitations
These dimensions help assess the quality of data.
What is data consistency?
Ensuring data is uniform and reported the same way across different levels
This applies to both individual variables and broader databases.
What is data accuracy?
Whether the data is correct
Checking data accuracy often involves verifying it against an outside source.
What is data completeness?
Checking for gaps in data, such as missing values or entire variables
This is essential for valid analyses.
What does data integrity encompass?
It includes consistency, accuracy, completeness, and security
Data integrity is crucial in regulated fields like pharmaceuticals.
What are data quality rules and metrics?
Guidelines that define acceptable data standards and formats
These include cutoff scores and conformity rules.
What is cross-validation?
A statistical analysis that checks if results can be generalized
It helps assess model effectiveness and reduce test error.
What are sample/spot checks?
Quick checks focusing on one or two data quality dimensions
They are often prompted by unusual data observations.
What are reasonable expectations in data quality?
Assessing whether data values make sense based on historical norms
This can involve formalized processes for flagging outliers.
What is data profiling?
A formal process that checks data quality across entire databases
It usually includes structure, content, and relationship discovery.
What is a data audit?
A systematic check to see if a dataset meets specific goals
Audits are often scheduled and performed at all stages of the data lifecycle.
What is master data management (MDM)?
The process of creating and managing a centralized data system
MDM aims to create a ‘golden record’ for improved data quality.
When should MDM be used?
During:
* Mergers and acquisitions
* Compliance checks
* Streamlining data access
MDM helps integrate disparate data sources and manage protected data.
What is the benefit of having a golden record?
It provides a single source of truth with clean, standardized data
This facilitates faster access and higher data quality.
What challenges are associated with implementing MDM?
It can be labor-intensive and expensive to set up
Many companies may only implement MDM for specific data types.
What is policy in the context of data management?
Policy is in reference to compliance, ensuring all records are organized for easier regulation checks.
What does streamlining data access mean?
Streamlining data access allows faster retrieval of data from a single table without complex queries.
What is the first step in the MDM process?
Consolidation
What does consolidation involve in MDM?
Consolidation involves creating the golden record by combining data from multiple sources into one place.
What is the purpose of standardization in MDM?
Standardization makes data uniform, ensuring all data works together and is consistent.
What is a data dictionary?
A data dictionary is a document that defines variables, their attributes, structure, and relationships.
Why are data dictionaries important?
They help ensure that multiple users understand the data and its usage.
What does data quality control involve?
Data quality control involves checking for accuracy, consistency, and reliability of data.
When should data quality be checked?
Data quality should be checked after data manipulation, after data transformation, and before the final report.
Which of the following is a data quality dimension: Data completeness, Data retention, Rows passed, or Data manipulation?
Data completeness
What is data profiling?
Data profiling is a structured formal process for assessing the quality and efficiency of an entire database.
True or False: Acquisitions are an appropriate time to institute MDM.
True
Creating a document that explains variables in a dataset represents which part of the MDM process?
Data dictionary
Fill in the blank: _______ is the process of combining data from multiple sources into one place.
Consolidation
Fill in the blank: A data dictionary provides definitions for every variable, as well as how they are used and how they _______.
relate to other variables
What should be included in a data dictionary?
Definitions and attributes for every variable, structure, relationships, and data organization.
What is the significance of having a data dictionary in a collaborative database environment?
It ensures that all users understand the data and its usage, preventing confusion.
List three circumstances where data quality should be checked.
- After data manipulation
- After data transformation
- Before the final report
What is the main goal of standardization in data management?
To ensure all data works together and is consistent across different sources.