Chapter 4 - Data Quality Flashcards
integrating multiple systems and inappropriate database design are two causes of what? What is meant by inappropriate design?
Data redundancy
Inappropriate design means when transactional databases aren’t in 3rd normal form.
What types of data do the below statements describe?
1) Same or similar data elements exist in multiple places
2) Identical copies of the same information exists in multiple places
1) Redundant Data
2) Duplicate Data
Can you list the 8 Data Quality challenges?
1) Duplicate Data
2) Redundant Data
3) Invalid Data
4) Data Type Validation
5) Missing Values
6) Specification Mismatch
7) Nonparametric Data
8) Data Outliers
what must you watch out for on the data quality challenge around nonparametric data?
if the rank order of the values is significant
which data quality issue will get if you don’t ensure validation of inbound data consistently maps to its target data type?
Specification Mismatch
What do you need to do to ensure you don’t get specification mismatch?
you need to ensure that inbound data is correctly mapped to its target data type
What data manipulation technique is helpful when you have numeric data you want to split into subcategories categorize to facilitate analysis?
Recoding
increasing the ages in an age column by 4 years is also an example of recoding
Regarding data manipulation, what technique describes creating a new variable from a calculation on an existing table?
Derived variable
Why is it not a good idea to use a column to store derived variables from another column? What should you do instead?
if the nature of the variables change over time this would need constant updates. Instead, derived variables should always be embedded as code so it is calculated only when needed.
the book misses out that storing a data in a column as a derived variable does not auto-update. Hence you get the issue. It is best to embed the calculation in the query itself or code.
Which data manipulation technique would you use if you wanted a holistic view of a particular subject?
Data merging
What data manipulation technique helps to ensure data is consistent, complete and accurate?
Data Merging
What’s the difference between ETL and data blending when it comes to combining data?
ETL combines multiple sources of data into a single data set in a data warehouse database. Whereas, data blending only does this at the reporting layer.
If the IT infrastructure was struggling to do ETL effectively, what other technique can you use to combine datasets that has less impact on IT?
Data blending using a data visualization tool
A data analyst MUST understand what if they’re to use data blending techniques?
they must understand how data maps across systems
if you needed to combine the variables of several columns into a single variable column, what data manipulation technique would you use?
Concatenation
a data append combines data from different sources into a single data set, but how does it differ from data merge?
it differs by the fact that the source data structure must be the SAME. I.e. if combining two data sources together, those data sources contain exactly the same data attributes/columns.
Whereas, with data merge, the source data comes from different data sets
you have the same data being recorded in different locations and you need to combine them into a single data set, what manipulation technique does this?
Data Append
Imputation is a data manipulation technique to deal with what problem? [IMPORTANT]
missing values in a data set.
List the 5 data imputation methods for dealing with missing values numeric values
1) Removing rows containing missing values
2) Replace with zero
3) Replace with overall average
4) Replace with Most Frequent (mode)
5) Closest Value average
reduction is the process of what?
shrinking a dataset WITHOUT negatively impacting its analytical value
Removing attributes to reduce a dataset’s overall size is known as what? Why would this be done?
It’s known as dimensionality reduction. It’s done to make data analysis on big datasets more efficient
Histograms are a method of __________ __________
numerosity reduction
this reduces QUANTITATIVE data
list the 3 methods to reduce a dataset to make big data analysis more efficient?
numerosity reduction
dimensionality reduction
sampling
what manipulation technique summarizes data saving you going through it by searching?
Data aggregation calculations
if you wanted to present data but maintain privacy, which manipulation technique allows you to do this?
Data Aggregation allows you to do this by summarizing the data instead.
What data manipulation technique re-arranges the data for better visibility and prepares it for use in data visualization tools
Transposition
what data manipulation technique helps to reveal patterns hidden in the original format of the data?
Transposition
What combination of data manipulation techniques greatly enhances ability to see trends, efficiently calculate dimensions, and can provide a more holistic view?
Transposition and Aggregation
pick the right word for the sentences:
Use ______ to address composite structural issues
Use ______ to address distributed structural issues
Words: Parsing / String Manipulation
Use parsing to address composite structural issues
Use string manipulation to address distributed structural issues
(When would you use string over concatenation?)
What stage do you need to perform string manipulation, before or after you start analysis?
before analysis
data quality: influencing data quality as early as possible in WHAT part of the analytics life-cycle?
the Acquisition phase.
Data quality issues can be introduced at all stages of the data life-cycle, these are places to introduce quality control checks, but where are they?
Data Acquisition phase
Data Manipulation Phase:
–Data Transformation (ETL)
–Data Conversion (ETL)
–Data Manipulation (ETL)
Data visualization phase:
(Final Product Preparation)
add more questions about data quality issues
You’re implementing some automated validation, what do you need understand first?
how source data fields map to their corresponding database columns and specifically, their data-types!
what type of validation automation would you implement to control input form errors?
you would use data-type validation automation to ensure the correct data-type is used in the input form.
What two examples of automated validation does the author provide?
Data-type validation
Validation of number of data points
How might you prevent missing data (as in, complete rows of data and not just missing values) from flowing into you analytics environment?
e.g. if source day-temp was measured 24 times a day, you should always get 24 rows.
Implement automated validation on the verification of the number of data points expected for a given attribute
List the 6-dimensions of data quality
1) Accuracy
2) Completeness
3) Consistency
4) Timeliness
5) Uniqueness
6) Validity
What data quality dimension denotes how closely a given attribute matches how you intended to use it?
data accuracy
Match the data quality dimension to the questions asked:
1) Accuracy
2) Completeness
3) Consistency
4) Timeliness
5) Uniqueness
6) Validity
a) what do I intend to this use this attribute for?
b) is my attribute within its expected range?
c) is it relevant when I obtain this data?
d) how reliable is this data, i.e. is it correct? or the the same across different systems?
e) Does my data have the minimum number of attributes in order to carry out the purpose the data is collected for?
f) Can we reduce the number of systems this data attribute exists on?
1) Accuracy / a) what do I intend to this use this attribute for?
2) Completeness / e) Does my data have the minimum number of attributes in order to carry out the purpose the data is collected for?
3) Consistency / d) how reliable is this data, i.e. is it the same across different systems?
4) Timeliness / c) is it relevant when I obtain this data?
5) Uniqueness / f) Can we reduce the number of systems this data attribute exists on?
6) Validity / b) is my attribute within its expected range?
when is the best time to check the validity of data?
At the moment of creation.
Taking the count of rows that pass or fail during an ETL load is one method to measure what?
data conformity
the author lists 5 methods to validate quality, list them:
these methods are broader than validating data-types, at creation checks or manipulation during ETL
1) Reasonable Expectations
2) Data Profiling
3) Data Audits
4) Sampling
5) Cross-Validation
Babs expects roughly 30 million records to be loaded monthly into the DWH. What data validation check could she do to detect issues with the ETL load?
Implement a Reasonable Expectation check by creating an exception report that will alert if the number of records is below a reasonably expected amount.
whether the data in your analytic environment looks as it should. A.K.A?
Reasonable Expectations
Which data quality validation uses statistical measures to check for discrepancies and is good at identifying irregular patterns within your data?
Data Profiling
Results of data profiling can be used as a Reasonable Expectations test, true or false?
TRUE. The results can highlight patterns that can be defined as being within or outside of reasonable expectations
which data quality validation method also uses Data Profiling techniques to help validate data integrity, but also helps in identifying security issues?
Data Audits
Regular _______ ensures that data quality remains high throughout the data life-cycle
data profiling
data profiling is often used at which part of the data life-cycle?
At the Acquisition phase
which data quality validation method is a cost effective way to check the quality of data?
Data Sampling
Cross-validation, whilst isn’t directly used for data quality validation, indirectly helps identify issues with another data quality validation method, which is it?
Cross-validation helps detect bias in the method of data SAMPLING
what do you need to be mindful of when it comes to data outliers?
Their impact on statistical analysis
what is the difference between the two data quality challenges of duplicate data and redundant data?
1) Duplicate data is identical data or values that appear more than once (maybe considered necessary/valid duplication)
2) Whereas redundant data is unnecessary duplication that can be removed safely.
think about redundancy in terms of normalization of tables, instead of record the same value multiple times, these can be ‘removed’ by creating a reference table.
what data manipulation technique would you use if you wanted to create a master customer profile made from different database subjects about a particular customer?
Data Merge
when does data merging usually occur?
At the Transformation part of ETL process
What data manipulation technique is useful when you have numeric data you want to analyse by category?
Recoding
regarding recoding data, what is meant by when the values are ordinal and nominal?
Ordinal values where the category has an inherent rank, like T-shirt size for example
Nominal values are when the categories have no particular order, like hair colour for instance.
An analyst ideally needs to automate quality improvements at which part of the data life-cycle?
the data acquisition phase (collection and preparation)
What step must you undertake before applying Imputation to deal with null values?
You must understand the context of the null values by explicitly checking them.
Bizarrely, the data quality dimension of “data validity” specifically identifies what? The author also says it is interchangeable with another term which I disagree with, what is it?
Whether a given value falls within an expected range
The author also says data validity is AKA as data integrity!
Data derived from categorical or ordinal rank
nonparametric data
two reasons responsible for creating duplicate data
H
In order:
1) Humans
2) Multiple data sources
Two ways to deal with duplicate data
`
1) Prevent creation in first place (best)
2) Employ duplicate resolution process
Use warning prompts in software to prevent creation
causes of redundant data
Integrating multiple systems that use shared data elements
Tables not in 3NF
Options to treat redundant data
Synchronize changes to shared data elements
Tables in 3NF
values outside the valid range for a given attribute
Invalid data
Invalid values are easy to check for what data?
Numeric and date data
It’s trickier to check invalid data when it is what data?
What would you implement to avoid invalid data of this kind?
Text data
Implementing good referential integrity resolves this
when the source column has a different datatype to the destination column
Specification Mismatch
how to resolve specification mismatch
validate data mapping of columns
data-type inconsistency (within the same column) during data load causes what?
Database load process to fail/stop
prevention of database load failures caused by inconsistent data-types
Validate data-type consistency before data load
combines multiple sources of data into a single dataset and presents it at the reporting layer
data blending
this is done through software/vizualizer
Aggregating a set of value ranges into a bin and the counting how frequently the range within that bin occurs
What is this and what is it an example of?
A histogram
An example of numerosity reduction
because it summarizes values into bins and counts their frequency instead of the individual datapoints themselves. Thereby reducing the data
a raw data source has multiple, distinct values combined within a single character column
e.g. m014 (sex = m, age = 0-14)
Composite column
considered a composite structural issue
The are all part of the data cleaning and manipulation phase. What process typically handles these?
1) Data Validation: Ensuring data consistency and accuracy during extraction.
2) Data Screening: Identifying issues during the extraction phase.
3) De-duplication: Removing duplicate records.
4) Handling Invalid Data: Transforming invalid entries into valid ones.
5) Handling Missing Data: Addressing missing values during transformation.
ETL process
verify limitations of data source AND present drop-down lists on form entry
are two ways to check for data quality at what phase of the data analysis life-cycle?
Data Acquisition
Where source data comes from more than one source and you’re combining these into one table, what quality issue can arise and you should check for?
What phase of the data analysis life-cycle does this align with and what process handles this?
Conversion issues (data consistency)
You would check the source data doesn’t differ in value types and if it does you’d convert the data from one source so the column contains consistent data
This occurs in the Data Manipulation/Cleaning phase and the ETL process handles it
Identifying WHEN data is acquired and HOW it is transformed can act as a quality check in what process?
What phase of the data analysis life-cycle does this happen?
At ETL process
Data Manipulation phase
at any ETL stage! Intermediary or final hop into the Data Warehouse
How can data quality issues be introduced right at the last phase of the data analysis life-cycle (Data Visualization)? How might you mitigate / prevent this?
If the data analysis doesn’t have a good knowledge of how the data is mapped, they may end up joining tables wrongly and producing errors.
To avoid this happening, the analyist must know the data and its mapping differences between the source systems clearly.
elements of data accuracy, consistency, uniqueness and validity taken altogether are a measure of what?
Data Conformity
Moving failed rows into a staging area for remediation is an efficient way of dealing with what issue? Why?
Data Conformity issues
It prevents the entire load process from failing
which of the 5 broader data quality methods can detect data inconsistencies like missing, duplicate or incorrect data types?
Data Profiling
If you have to deal with a very large dataset and need to assess the quality of it, which of the 5 broader methods to validate data quality mentioned in the book would you use?
Sampling
and thus running descriptive
which part of the ETL process does data conversion take place?
Extract process
Which data collection methods are good for a valuable source of research and insight
Human in the loop (Survey’s and observation)
Which is the more modern variant - ELT or ETL?
ELT
sampling occurs where data is broken down into subgroups, like gender, and then randomly sampled from each of the groups.
Stratified
sampling method where each record of data has an equal chance of being selected for the data set used in the analysis
Simple Random
which out of the below methods is suited for minute-by-minute updates into the datawarehouse?
ELT or ETL
ELT
data that is repeated (identical) within the same dataset
duplicate data
nonparametric data can have what effect on the normal distribution?
it can skew it left or right
leading spaces, trailing spaces and non printable characters in a field will mean the data is what?
invalid data
it combines multiple fields into a single field
it allows analysts to consolidate data sets into a single dataset
DATA MERGE
to combine data from ONE set with another data set
DATA APPEND
a variable is one that’s created from existing data through calculations, transformations, or other operations.
DERIVED
function to join two datasets with same columns and type and delete original table afterwards?
INLINE APPEND
What is a field property setting that tells the database that a field needs to be tabulated?
INDEXING
internal processing field property setting that controls how much data must be looked at when processing the data for queries involving commonly sorted or filtered data?
INDEX FIELD
data is entered the same way each time and as intended based on the business rules. Example of what quality dimension?
Consistency
what is the most common method to prepare data for a DWH?
ETL or ELT?
ETL