Chapter 4 - Data Quality Flashcards
integrating multiple systems and inappropriate database design are two causes of what? What is meant by inappropriate design?
Data redundancy
Inappropriate design means when transactional databases aren’t in 3rd normal form.
What types of data do the below statements describe?
1) Same or similar data elements exist in multiple places
2) Identical copies of the same information exists in multiple places
1) Redundant Data
2) Duplicate Data
Can you list the 8 Data Quality challenges?
1) Duplicate Data
2) Redundant Data
3) Invalid Data
4) Data Type Validation
5) Missing Values
6) Specification Mismatch
7) Nonparametric Data
8) Data Outliers
what must you watch out for on the data quality challenge around nonparametric data?
if the rank order of the values is significant
which data quality issue will get if you don’t ensure validation of inbound data consistently maps to its target data type?
Specification Mismatch
What do you need to do to ensure you don’t get specification mismatch?
you need to ensure that inbound data is correctly mapped to its target data type
What data manipulation technique is helpful when you have numeric data you want to split into subcategories categorize to facilitate analysis?
Recoding
increasing the ages in an age column by 4 years is also an example of recoding
Regarding data manipulation, what technique describes creating a new variable from a calculation on an existing table?
Derived variable
Why is it not a good idea to use a column to store derived variables from another column? What should you do instead?
if the nature of the variables change over time this would need constant updates. Instead, derived variables should always be embedded as code so it is calculated only when needed.
the book misses out that storing a data in a column as a derived variable does not auto-update. Hence you get the issue. It is best to embed the calculation in the query itself or code.
Which data manipulation technique would you use if you wanted a holistic view of a particular subject?
Data merging
What data manipulation technique helps to ensure data is consistent, complete and accurate?
Data Merging
What’s the difference between ETL and data blending when it comes to combining data?
ETL combines multiple sources of data into a single data set in a data warehouse database. Whereas, data blending only does this at the reporting layer.
If the IT infrastructure was struggling to do ETL effectively, what other technique can you use to combine datasets that has less impact on IT?
Data blending using a data visualization tool
A data analyst MUST understand what if they’re to use data blending techniques?
they must understand how data maps across systems
if you needed to combine the variables of several columns into a single variable column, what data manipulation technique would you use?
Concatenation
a data append combines data from different sources into a single data set, but how does it differ from data merge?
it differs by the fact that the source data structure must be the SAME. I.e. if combining two data sources together, those data sources contain exactly the same data attributes/columns.
Whereas, with data merge, the source data comes from different data sets
you have the same data being recorded in different locations and you need to combine them into a single data set, what manipulation technique does this?
Data Append
Imputation is a data manipulation technique to deal with what problem? [IMPORTANT]
missing values in a data set.
List the 5 data imputation methods for dealing with missing values numeric values
1) Removing rows containing missing values
2) Replace with zero
3) Replace with overall average
4) Replace with Most Frequent (mode)
5) Closest Value average
reduction is the process of what?
shrinking a dataset WITHOUT negatively impacting its analytical value
Removing attributes to reduce a dataset’s overall size is known as what? Why would this be done?
It’s known as dimensionality reduction. It’s done to make data analysis on big datasets more efficient
Histograms are a method of __________ __________
numerosity reduction
this reduces QUANTITATIVE data
list the 3 methods to reduce a dataset to make big data analysis more efficient?
numerosity reduction
dimensionality reduction
sampling
what manipulation technique summarizes data saving you going through it by searching?
Data aggregation calculations
if you wanted to present data but maintain privacy, which manipulation technique allows you to do this?
Data Aggregation allows you to do this by summarizing the data instead.
What data manipulation technique re-arranges the data for better visibility and prepares it for use in data visualization tools
Transposition
what data manipulation technique helps to reveal patterns hidden in the original format of the data?
Transposition
What combination of data manipulation techniques greatly enhances ability to see trends, efficiently calculate dimensions, and can provide a more holistic view?
Transposition and Aggregation
pick the right word for the sentences:
Use ______ to address composite structural issues
Use ______ to address distributed structural issues
Words: Parsing / String Manipulation
Use parsing to address composite structural issues
Use string manipulation to address distributed structural issues
(When would you use string over concatenation?)
What stage do you need to perform string manipulation, before or after you start analysis?
before analysis
data quality: influencing data quality as early as possible in WHAT part of the analytics life-cycle?
the Acquisition phase.
Data quality issues can be introduced at all stages of the data life-cycle, these are places to introduce quality control checks, but where are they?
Data Acquisition phase
Data Manipulation Phase:
–Data Transformation (ETL)
–Data Conversion (ETL)
–Data Manipulation (ETL)
Data visualization phase:
(Final Product Preparation)
add more questions about data quality issues
You’re implementing some automated validation, what do you need understand first?
how source data fields map to their corresponding database columns and specifically, their data-types!
what type of validation automation would you implement to control input form errors?
you would use data-type validation automation to ensure the correct data-type is used in the input form.
What two examples of automated validation does the author provide?
Data-type validation
Validation of number of data points
How might you prevent missing data (as in, complete rows of data and not just missing values) from flowing into you analytics environment?
e.g. if source day-temp was measured 24 times a day, you should always get 24 rows.
Implement automated validation on the verification of the number of data points expected for a given attribute
List the 6-dimensions of data quality
1) Accuracy
2) Completeness
3) Consistency
4) Timeliness
5) Uniqueness
6) Validity
What data quality dimension denotes how closely a given attribute matches how you intended to use it?
data accuracy
Match the data quality dimension to the questions asked:
1) Accuracy
2) Completeness
3) Consistency
4) Timeliness
5) Uniqueness
6) Validity
a) what do I intend to this use this attribute for?
b) is my attribute within its expected range?
c) is it relevant when I obtain this data?
d) how reliable is this data, i.e. is it correct? or the the same across different systems?
e) Does my data have the minimum number of attributes in order to carry out the purpose the data is collected for?
f) Can we reduce the number of systems this data attribute exists on?
1) Accuracy / a) what do I intend to this use this attribute for?
2) Completeness / e) Does my data have the minimum number of attributes in order to carry out the purpose the data is collected for?
3) Consistency / d) how reliable is this data, i.e. is it the same across different systems?
4) Timeliness / c) is it relevant when I obtain this data?
5) Uniqueness / f) Can we reduce the number of systems this data attribute exists on?
6) Validity / b) is my attribute within its expected range?
when is the best time to check the validity of data?
At the moment of creation.