Data Quality Flashcards
Data Quality Indicators (TARMAC)
Trackability, Acceptibility, Relevance, Measureability, Accountability, Controllability
TARMAC - trackability
Can measure data quality over time
TARMAC - Acceptability
be able to define what good looks like
TARMAC - Relevance
Make sure measuring something relevant to the business
TARMAC - Measureability
What will actually be measured? how can it be measured?
TARMAC - Accountability/Stewardship
Who will be held accountable if it goes wrong?
TARMAC - Controllability
Defining remedial actions in advance of the thing going wrong
What must you know to be able to define quality quality?
the purpose of use of the data
How can you manage data quality when you don’t know the purpose?
don’t over assume - stick to basic validity
Reference Data
data not subject to change e.g., identifiers
Master Data
Descriptive attributes of business entities
What can be defined as data standards?
data types, acceptable values, attribute domains, metadata format
What are the two type of data quality management?
- Governing/Strategic
- Tactical
What is tactical data quality management?
short terms fixing of problems
What is governing/strategic data quality management?
Overarching long term goals e.g., root cause analysis
Common DQ mistakes
- failing to consider the intended use of the data
- Confusing Validity and accuracy
- treating it as a one time activity
- not fixing at the source
- applying software quality principle’s
- laziness, blaming the system
- believing good data quality is the end goal (not the use of that data)
- believing that quantity beats quality
Data quality firewall
taking external data and applying data cleansing before it is stored in the DB
Impacts of of poor data
- aggravation
- loss of reputation
- loss of business
- regulatory risk
- loss of life
Why is it important to communicate the cost of poor quality data?
to raise awareness
Data Quality Management Cycle
Plan -> Deploy -> Monitor -> Action
What are the four data quality governance steps
- standardisation
- assignment
- escalation
- completion
Causes of data issues
Human causes
Organisational (system)
Physical causes
roles of a the data quality oversight board
- setting data quality improvement priorities
- establishing communications & feedback mechanisms
- producing certification & compliance policies
- approving data quality strategies
Data Quality Service Level Agreement (SLA) will include
defining roles & responsibilities for data quality
What is a key process of defining data quality business rules
separating data that does not meet business needs from the data that dies
Why is top down and bottom up profiling best done together?
it balances the business relevance and the actual state of the data
Steps in root cause analysis
- define the problem
- collect data
- identify all possible casual factors
- identify root causes(s)
- recommend and implement solutions
Dimensions of data quality
Completeness
Consistency
Currency
Reasonableness
Integrity
Timeliness
Validity
Accuracy
Uniqueness
(privacy)
(precision)
Completeness Data Quality DImension
All mandatory values are present
Consistency Data Quality DImension
Data of one concept corresponds with the same concept in another system
Currency Data Quality DImension
Is the data up to date
Reasonableness Data Quality Dimension
Business rules, does it feel right/ is it inline with what is expected
Integrity Data Quality Dimension
Child data must have a parent
Timeliness Data Quality Dimension
Accessibility/ availability
Validitity Data Quality Dimension
Is the value in the correct domain?
Accuracy Data Quality Dimension
Does the data correctly represent the real life model
Uniqueness Data Quality Dimension
business concept must not be duplicated.
How to measure data quality?
Stats (sampling / basic summaries/ process control charts)
Profiling ( manual, tools, columnar, intra-table, cross-table, cross-table)
information flow diagrations
process of profiling
- identify subset of data
- understand business use
- put into profiling tool
- list potential anomalies
- prioritise critically
Whats the output of profiling tool?
counts, summaries, data types, PKs, percentage of completeness, identification (e.g., duplicated records, out of range values).
Inspecting the quality of data using statistical techniques is called
data profiling
Advantages of defining data quality rules upfront
- setting clear expectations for data quality
- creating the foundation for ongoing data quality measurement
- providing the requirements for system control to prevent quality issues
- provide data quality requirements to external parties
What 3 levels of data granularity should you measure for data quality
Data element value, record & dataset
How is data quality management and data governance linked
- both are essential for organisation success
- both ongoing efforts
- governance supports DQ
- DQ sustains governance
Where to focus for DQ
- focus on critical data
- focus on preventing errors (not just fixing)
- address the root cause of problems
- enforce quality standards
Shewhart data quality cycle stages
plan do check act
(plan deploy monitor act)
When measuring data quality which three levels of granularity should you measure?
Data element, record, dataset
Data profiling software
Data profiling software investigates data to understand its structure, content and quality. It helps us find patterns and problems in the data.
Data Quality dimensions
Accuracy
Completeness
Integrity
Uniqueness
Consistency
Data Accuracy
How closely data represents reality
(hard to measure, compared to trusted sources e.g. check that postcodes match real postcodes)
Complete data
all data is present, no gaps
(depends on mandatory or optional fields)
Data consistency
Making sure 2 or more representation of something are the same
Data Integrity
Making sure data is complete, accuracy and consistent
(making sure data objects are connected properly)
Referential Integrity
the connections between data objects it consistent
Internal Consistency Problem Example
list of names and emails, where 2 people have the same name, or some names don’t have emails.
Orphan
A data object with a missing or invalid reference to another data object
Data Quality Oversight Board
Provides strategic direction with policies & activities.
Data Value Domain
A set of rules that describe the set of values that can be taken.
Business rules are not required for….
critical data improvement