Data Quality Flashcards
What is the DMBoK definitiion of data quality management?
The planning, implementation, and control of activities that apply quality management techniques to data, in order to assure it is fit for consumption and meets the needs of data consumers.
What are the 4 business drivers for establishing a formal Data Quality Management program?
- Increasing the value of organizational data and the opportunities to use it
- Reducing risks and costs associated with poor quality data
- Improving organizational efficiency and productivity
- Protecting and enhancing the organization’s reputation
7 direct costs are associated with poor quality data. Name 4
- Inability to invoice correctly
- Increased customer service calls and decreased ability to resolve them
- Revenue loss due to missed business opportunities
- Delay of integration during mergers and acquisitions
- Increased exposure to fraud
- Loss due to bad business decisions driven by bad data
- Loss of business due to lack of good credit standing
What are the 4 goals Data Quality programs focus on?
- Developing a governed approach to make data fit for purpose based on data consumers’ requirements
- Defining standards and specifications for data quality controls as part of the data lifecycle
- Defining and implementing processes to measure, monitor, and report on data quality levels
- Identifying and advocating for opportunities to improve the quality of data, through changes to processes and systems and engaging in activities that measurably improve the quality of data based on data consumer requirements
Data Quality programs should be guided by these 10 principles
- Criticality:
- Lifecycle management
- Prevention
- Root cause remediation
- Governance
- Standards-driven
- Objective measurement and transparency
- Embedded in business processes
- Systematically enforced
- Connected to service levels
Which principle of Data Quality Management is to focus improvement efforts on data that is most important to the organization and its customers?
Criticality or Critical Data
What are the six core dimensions of data quality?
- Completeness: The proportion of data stored against the potential for 100%.
- Uniqueness: No entity instance (thing) will be recorded more than once based upon how that thing is identified.
- Timeliness: The degree to which data represent reality from the required point in time.
- Validity: Data is valid if it conforms to the syntax (format, type, range) of its definition.
- Accuracy: The degree to which data correctly describes the ‘real world’ object or event being described.
- Consistency: The absence of difference, when comparing two or more representations of a thing
against a definition.
The _________ cycle is a problem-solving model known as “plan-do-check-act’.
Shewhart / Deming
In the ________ stage of the DQ Improvement Life Cycle, the Data Quality team assesses the scope, impact, and priority of known issues, and evaluates alternatives to address them.
Plan
In the ________ stage of the DQ Improvement Life Cycle, the DQ team leads efforts to address the root causes of issues and plan for ongoing monitoring of data.
Do
In the ________ stage of the DQ Improvement Life Cycle, the team actively monitors the quality of data as measured against requirements. As long as data meets defined thresholds for quality, additional actions are not required.
Check
In the ________ stage of the DQ Improvement Life Cycle, activities occur to address and resolve emerging data quality issues.
Act
What framework focuses on data consumers’ perceptions of data. It describes 15 dimensions across four general categories of data quality:
Strong-Wang Framework
What 4 general categories are described in the Strong-Wang framework?
- Intrinsic DQ
- Contextual DQ
- Representational DQ
- Accessibility DQ
In the Strong-Wang framework, What 4 dimensions are there in Intrinsic Data Quality?
o Accuracy
o Objectivity
o Believability
o Reputation
In the Strong-Wang framework, Which of these dimensions is not part of Contextual Data Quality?
o Value-added
o Interpretability
o Timeliness
o Completeness
o Appropriate amount of data
Interpretability. Should be relevancy
As part of the Strong-Wang Framework, which data quality category do these dimension belong?
o Interpretability
o Ease of understanding
o Representational consistency
o Concise representation
Representational DQ
In the Strong-Wang Framework Accessibility DQ category there are two dimensions, what are they?
o Accessibility
o Access security
There are 8 DQ issues caused by Poor System Design, name 6 of them.
- Failure to enforce referential integrity
- Failure to enforce uniqueness constraints
- Coding inaccuracies and gaps
- Data model inaccuracies
- Field overloading: Re-use of fields over time for different purposes,
- Temporal data mismatches: In the absence of a consolidated data dictionary, multiple systems could implement disparate date formats or timings, which in turn lead to data mismatch and data loss when
data synchronization takes place between different source systems. - Weak Master Data Management
- Data duplicatiom: Single Source / Multiple Local Instances
o Multiple Sources / Single Instance
________________ is a form of data analysis used to inspect data and assess quality. It uses statistical techniques to discover the true structure, content, and quality of a collection of data.
Data Profiling
Name the 2 activities prevalent in Data Quality Management
Maturity Assessment and Profiling
What are the 5 statistical techniques used to inspect data and assess quality in data profiling?
- Counts of nulls
- Max/Min value
- Max/Min length
- Frequency distribution of values for individual columns
- Data type and format
Profiling also includes __________ analysis, which can identify overlapping or duplicate columns and expose embedded value dependencies.
cross-column
_____________ analysis explores overlapping values sets and helps identify foreign key relationships.
Inter-table
______________ is the process of adding attributes to a data set to increase its quality and usability
Data enhancement or enrichment
There are 8 type of data enrichment or enhancements listed in DMBok. Name 6 of them.
- Time/Date stamps
- Audit data
- Reference vocabularies
- Contextual information
- Geographic information
- Demographic information
- Psychographic information
- Valuation information
Which data enrichment or enhancement type adds information such as location, environment, or access methods and tagging data for review and analysis.
Contextual Information
Which data enrichment or enhancement type is used to segment the target populations by specific behaviors, habits, or preferences, such as product and brand preferences, organization memberships, leisure activities, commuting transportation style, shopping time preferences, etc.?
Psychographic information
Which data enrichment or enhancement type would include data lineage?
Audit data
Which data enrichment or enhancement type uses business specific terminology, ontologies, and glossaries to enhance understanding and control while bringing customized business context.
Reference vocabularies
Which data enrichment or enhancement is used for asset valuation, inventory, and sale?
Valuation Information
_____________ is the process of analyzing data using pre-determined rules to define its content or value. This process enables the data analyst to define sets of patterns that feed into a rule engine used to distinguish between valid and invalid data values. Matching specific pattern(s) triggers actions.
Data Parsing
In Data Quality, having proven that the improvement process can work, the next goal is to apply it strategically. Doing so requires _______ and ______ potential improvements.
identifying and prioritizing
Provide continuous monitoring by incorporating ________ and __________ processes into the information processing flow.
control and measurement
Data quality incident tracking requires staff be trained on how issues should be _______, _________, and _________.
logged, classified, and tracked
Data quality reporting should focus on these 7 areas.
- Data quality scorecard, which provides a high-level view of the scores associated with various metrics, reported to different levels of the organization within established thresholds
- Data quality trends, which show over time how the quality of data is measured, and whether trending is up or down
- SLA Metrics, such as whether operational data quality staff diagnose and respond to data quality incidents in a timely manner
- Data quality issue management, which monitors the status of issues and resolutions
- Conformance of the Data Quality team to governance policies
- Conformance of IT and business teams to Data Quality policies
- Positive effects of improvement projects
Name 6 ways to prevent poor quality data from entering an organization.
- Establish data entry controls
- Train data producers
- Define and enforce rules: Create a ‘data firewall,’ which has a table with all the business data quality rules used to check if the quality of data is good, before being used in an application such a data
warehouse. - Demand high quality data from data suppliers
- Implement Data Governance and Stewardship: Ensure roles and responsibilities are defined that describe and enforce rules of engagement, decision rights, and accountabilities for effective management of data and information assets (McGilvray, 2008). Work with data stewards to revise theprocess of, and mechanisms for, generating, sending, and receiving data.
- Institute formal change control: Ensure all changes to stored data are defined and tested before being implemented. Prevent changes directly to data outside of normal processing by establishing gating processes.
____________ actions are implemented after a problem has occurred and been detected.
Corrective
What are 3 corrective actions that can be taken against poor data?
- Automated correction
- Manually-directed correction
- Manual correction
Which corrective action uses automated tools to remediate and correct data but requires manual review before committing the corrections to persistent storage.
Manually-directed correction
What are the 4 goals of a data quality program?
- Developing a governed approach to make data fit for purpose based on data consumers’ requirements
- Defining standards and specifications for data quality controls as part of the data lifecycle
- Defining and implementing processes to measure, monitor, and report on data quality levels
- Identifying and advocating for opportunities to improve the quality of data, through changes to processes and systems and engaging in activities that measurably improve the quality of data based on data consumer requirements