Chapter 4 - Data Quality Flashcards by StuffOffYouStuff .

integrating multiple systems and inappropriate database design are two causes of what? What is meant by inappropriate design?

Data redundancy
Inappropriate design means when transactional databases aren’t in 3rd normal form.

How well did you know this?

Not at all

Perfectly

What types of data do the below statements describe?

1) Same or similar data elements exist in multiple places
2) Identical copies of the same information exists in multiple places

1) Redundant Data
2) Duplicate Data

How well did you know this?

Not at all

Perfectly

Can you list the 8 Data Quality challenges?

1) Duplicate Data
2) Redundant Data
3) Invalid Data
4) Data Type Validation
5) Missing Values
6) Specification Mismatch
7) Nonparametric Data
8) Data Outliers

How well did you know this?

Not at all

Perfectly

what must you watch out for on the data quality challenge around nonparametric data?

if the rank order of the values is significant

How well did you know this?

Not at all

Perfectly

which data quality issue will get if you don’t ensure validation of inbound data consistently maps to its target data type?

Specification Mismatch

How well did you know this?

Not at all

Perfectly

What do you need to do to ensure you don’t get specification mismatch?

you need to ensure that inbound data is correctly mapped to its target data type

How well did you know this?

Not at all

Perfectly

What data manipulation technique is helpful when you have numeric data you want to split into subcategories categorize to facilitate analysis?

Recoding

increasing the ages in an age column by 4 years is also an example of recoding

How well did you know this?

Not at all

Perfectly

Regarding data manipulation, what technique describes creating a new variable from a calculation on an existing table?

Derived variable

How well did you know this?

Not at all

Perfectly

Why is it not a good idea to use a column to store derived variables from another column? What should you do instead?

if the nature of the variables change over time this would need constant updates. Instead, derived variables should always be embedded as code so it is calculated only when needed.

the book misses out that storing a data in a column as a derived variable does not auto-update. Hence you get the issue. It is best to embed the calculation in the query itself or code.

How well did you know this?

Not at all

Perfectly

Which data manipulation technique would you use if you wanted a holistic view of a particular subject?

Data merging

How well did you know this?

Not at all

Perfectly

What data manipulation technique helps to ensure data is consistent, complete and accurate?

Data Merging

How well did you know this?

Not at all

Perfectly

What’s the difference between ETL and data blending when it comes to combining data?

ETL combines multiple sources of data into a single data set in a data warehouse database. Whereas, data blending only does this at the reporting layer.

How well did you know this?

Not at all

Perfectly

If the IT infrastructure was struggling to do ETL effectively, what other technique can you use to combine datasets that has less impact on IT?

Data blending using a data visualization tool

How well did you know this?

Not at all

Perfectly

A data analyst MUST understand what if they’re to use data blending techniques?

they must understand how data maps across systems

How well did you know this?

Not at all

Perfectly

if you needed to combine the variables of several columns into a single variable column, what data manipulation technique would you use?

Concatenation

How well did you know this?

Not at all

Perfectly

a data append combines data from different sources into a single data set, but how does it differ from data merge?

it differs by the fact that the source data structure must be the SAME. I.e. if combining two data sources together, those data sources contain exactly the same data attributes/columns.
Whereas, with data merge, the source data comes from different data sets

How well did you know this?

Not at all

Perfectly

you have the same data being recorded in different locations and you need to combine them into a single data set, what manipulation technique does this?

Data Append

How well did you know this?

Not at all

Perfectly

Imputation is a data manipulation technique to deal with what problem? [IMPORTANT]

missing values in a data set.

How well did you know this?

Not at all

Perfectly

List the 5 data imputation methods for dealing with missing values numeric values

1) Removing rows containing missing values

2) Replace with zero

3) Replace with overall average

4) Replace with Most Frequent (mode)

5) Closest Value average

How well did you know this?

Not at all

Perfectly

reduction is the process of what?

shrinking a dataset WITHOUT negatively impacting its analytical value

How well did you know this?

Not at all

Perfectly

Removing attributes to reduce a dataset’s overall size is known as what? Why would this be done?

It’s known as dimensionality reduction. It’s done to make data analysis on big datasets more efficient

How well did you know this?

Not at all

Perfectly

Histograms are a method of __________ __________

numerosity reduction

this reduces QUANTITATIVE data

How well did you know this?

Not at all

Perfectly

list the 3 methods to reduce a dataset to make big data analysis more efficient?

numerosity reduction
dimensionality reduction
sampling

How well did you know this?

Not at all

Perfectly

what manipulation technique summarizes data saving you going through it by searching?

Data aggregation calculations

How well did you know this?

Not at all

Perfectly

if you wanted to present data but maintain privacy, which manipulation technique allows you to do this?

Data Aggregation allows you to do this by summarizing the data instead.

What data manipulation technique re-arranges the data for better visibility and prepares it for use in data visualization tools

Transposition

what data manipulation technique helps to reveal patterns hidden in the original format of the data?

Transposition

What combination of data manipulation techniques greatly enhances ability to see trends, efficiently calculate dimensions, and can provide a more holistic view?

Transposition and Aggregation

pick the right word for the sentences: Use ______ to address composite structural issues Use ______ to address distributed structural issues Words: Parsing / String Manipulation

Use parsing to address composite structural issues Use string manipulation to address distributed structural issues (When would you use string over concatenation?)

What stage do you need to perform string manipulation, before or after you start analysis?

before analysis

data quality: influencing data quality as early as possible in WHAT part of the analytics life-cycle?

the Acquisition phase.

Data quality issues can be introduced at all stages of the data life-cycle, these are places to introduce quality control checks, but where are they?

Data Acquisition phase Data Manipulation Phase: --Data Transformation (ETL) --Data Conversion (ETL) --Data Manipulation (ETL) Data visualization phase: (Final Product Preparation)

add more questions about data quality issues

You're implementing some automated validation, what do you need understand first?

how source data fields map to their corresponding database columns and specifically, their data-types!

what type of validation automation would you implement to control input form errors?

you would use data-type validation automation to ensure the correct data-type is used in the input form.

What two examples of automated validation does the author provide?

Data-type validation Validation of number of data points

How might you prevent missing data (as in, complete rows of data and not just missing values) from flowing into you analytics environment? e.g. if source day-temp was measured 24 times a day, you should always get 24 rows.

Implement automated validation on the verification of the number of data points expected for a given attribute

List the 6-dimensions of data quality

1) Accuracy 2) Completeness 3) Consistency 4) Timeliness 5) Uniqueness 6) Validity

What data quality dimension denotes how closely a given attribute matches how you intended to use it?

data accuracy

Match the data quality dimension to the questions asked: 1) Accuracy 2) Completeness 3) Consistency 4) Timeliness 5) Uniqueness 6) Validity a) what do I intend to this use this attribute for? b) is my attribute within its expected range? c) is it relevant when I obtain this data? d) how reliable is this data, i.e. is it correct? or the the same across different systems? e) Does my data have the minimum number of attributes in order to carry out the purpose the data is collected for? f) Can we reduce the number of systems this data attribute exists on?

1) Accuracy / a) what do I intend to this use this attribute for? 2) Completeness / e) Does my data have the minimum number of attributes in order to carry out the purpose the data is collected for? 3) Consistency / d) how reliable is this data, i.e. is it the same across different systems? 4) Timeliness / c) is it relevant when I obtain this data? 5) Uniqueness / f) Can we reduce the number of systems this data attribute exists on? 6) Validity / b) is my attribute within its expected range?

when is the best time to check the validity of data?

At the moment of creation.

Taking the count of rows that pass or fail during an ETL load is one method to measure what?

data conformity

the author lists 5 methods to validate quality, list them: ## Footnote these methods are broader than validating data-types, at creation checks or manipulation during ETL

1) Reasonable Expectations 2) Data Profiling 3) Data Audits 4) Sampling 5) Cross-Validation

Babs expects roughly 30 million records to be loaded monthly into the DWH. What data validation check could she do to detect issues with the ETL load?

Implement a Reasonable Expectation check by creating an exception report that will alert if the number of records is below a reasonably expected amount.

whether the data in your analytic environment looks as it should. A.K.A?

Reasonable Expectations

Which data quality validation uses statistical measures to check for discrepancies and is good at identifying irregular patterns within your data?

Data Profiling

Results of data profiling can be used as a Reasonable Expectations test, true or false?

TRUE. The results can highlight patterns that can be defined as being within or outside of reasonable expectations

which data quality validation method also uses Data Profiling techniques to help validate data integrity, but also helps in identifying security issues?

Data Audits

Regular _______ ensures that data quality remains high throughout the data life-cycle

data profiling

data profiling is often used at which part of the data life-cycle?

At the Acquisition phase

which data quality validation method is a cost effective way to check the quality of data?

Data Sampling

Cross-validation, whilst isn't directly used for data quality validation, indirectly helps identify issues with another data quality validation method, which is it?

Cross-validation helps detect bias in the method of data SAMPLING

what do you need to be mindful of when it comes to data outliers?

Their impact on statistical analysis

what is the difference between the two data quality challenges of duplicate data and redundant data?

1) Duplicate data is identical data or values that appear more than once (maybe considered necessary/valid duplication) 2) Whereas redundant data is unnecessary duplication that can be removed safely. ## Footnote think about redundancy in terms of normalization of tables, instead of record the same value multiple times, these can be 'removed' by creating a reference table.

what data manipulation technique would you use if you wanted to create a master customer profile made from different database subjects about a particular customer?

Data Merge

when does data merging usually occur?

At the Transformation part of ETL process

What data manipulation technique is useful when you have numeric data you want to analyse by category?

Recoding

regarding recoding data, what is meant by when the values are ordinal and nominal?

Ordinal values where the category has an inherent rank, like T-shirt size for example Nominal values are when the categories have no particular order, like hair colour for instance.

An analyst ideally needs to automate quality improvements at which part of the data life-cycle?

the data acquisition phase (collection and preparation)

What step must you undertake before applying Imputation to deal with null values?

You must understand the context of the null values by explicitly checking them.

Bizarrely, the data quality dimension of "data validity" specifically identifies what? The author also says it is interchangeable with another term which I disagree with, what is it?

Whether a given value falls within an expected range The author also says data validity is AKA as data integrity!

Data derived from categorical or ordinal rank

nonparametric data

two reasons responsible for creating duplicate data | H

In order: 1) Humans 2) Multiple data sources

Two ways to deal with duplicate data | `

1) Prevent creation in first place (best) 2) Employ duplicate resolution process ## Footnote Use warning prompts in software to prevent creation

causes of redundant data

Integrating multiple systems that use shared data elements Tables not in 3NF

Options to treat redundant data

Synchronize changes to shared data elements Tables in 3NF

values outside the valid range for a given attribute

Invalid data

Invalid values are easy to check for what data?

Numeric and date data

It's trickier to check invalid data when it is what data? What would you implement to avoid invalid data of this kind?

Text data Implementing good referential integrity resolves this

when the source column has a different datatype to the destination column

Specification Mismatch

how to resolve specification mismatch

validate data mapping of columns

data-type inconsistency (within the same column) during data load causes what?

Database load process to fail/stop

prevention of database load failures caused by inconsistent data-types

Validate data-type consistency before data load

combines multiple sources of data into a single dataset and presents it at the reporting layer

data blending | this is done through software/vizualizer

Aggregating a set of value ranges into a bin and the counting how frequently the range within that bin occurs What is this and what is it an example of?

A histogram An example of numerosity reduction ## Footnote because it summarizes values into bins and counts their frequency instead of the individual datapoints themselves. Thereby reducing the data

a raw data source has multiple, distinct values combined within a single character column e.g. m014 (sex = m, age = 0-14)

Composite column | considered a composite structural issue

The are all part of the data cleaning and manipulation phase. What process typically handles these? 1) Data Validation: Ensuring data consistency and accuracy during extraction. 2) Data Screening: Identifying issues during the extraction phase. 3) De-duplication: Removing duplicate records. 4) Handling Invalid Data: Transforming invalid entries into valid ones. 5) Handling Missing Data: Addressing missing values during transformation.

ETL process

verify limitations of data source AND present drop-down lists on form entry are two ways to check for data quality at what phase of the data analysis life-cycle?

Data Acquisition

Where source data comes from more than one source and you're combining these into one table, what quality issue can arise and you should check for? What phase of the data analysis life-cycle does this align with and what process handles this?

Conversion issues (data consistency) You would check the source data doesn't differ in value types and if it does you'd convert the data from one source so the column contains consistent data This occurs in the Data Manipulation/Cleaning phase and the ETL process handles it

Identifying WHEN data is acquired and HOW it is transformed can act as a quality check in what process? What phase of the data analysis life-cycle does this happen?

At ETL process Data Manipulation phase | at any ETL stage! Intermediary or final hop into the Data Warehouse

How can data quality issues be introduced right at the last phase of the data analysis life-cycle (Data Visualization)? How might you mitigate / prevent this?

If the data analysis doesn't have a good knowledge of how the data is mapped, they may end up joining tables wrongly and producing errors. To avoid this happening, the analyist must know the data and its mapping differences between the source systems clearly.

elements of data accuracy, consistency, uniqueness and validity taken altogether are a measure of what?

Data Conformity

Moving failed rows into a staging area for remediation is an efficient way of dealing with what issue? Why?

Data Conformity issues It prevents the entire load process from failing

which of the 5 broader data quality methods can detect data inconsistencies like missing, duplicate or incorrect data types?

Data Profiling

If you have to deal with a very large dataset and need to assess the quality of it, which of the 5 broader methods to validate data quality mentioned in the book would you use?

Sampling | and thus running descriptive

which part of the ETL process does data conversion take place?

Extract process

Which data collection methods are good for a valuable source of research and insight

Human in the loop (Survey's and observation)

Which is the more modern variant - ELT or ETL?

ELT

sampling occurs where data is broken down into subgroups, like gender, and then randomly sampled from each of the groups.

Stratified

sampling method where each record of data has an equal chance of being selected for the data set used in the analysis

Simple Random

which out of the below methods is suited for minute-by-minute updates into the datawarehouse? ELT or ETL

ELT

data that is repeated (identical) within the same dataset

duplicate data

nonparametric data can have what effect on the normal distribution?

it can skew it left or right

leading spaces, trailing spaces and non printable characters in a field will mean the data is what?

invalid data

it combines multiple fields into a single field it allows analysts to consolidate data sets into a single dataset

DATA MERGE

to combine data from ONE set with another data set

DATA APPEND

a variable is one that’s created from existing data through calculations, transformations, or other operations.

DERIVED

function to join two datasets with same columns and type and delete original table afterwards?

INLINE APPEND

What is a field property setting that tells the database that a field needs to be tabulated?

INDEXING

internal processing field property setting that controls how much data must be looked at when processing the data for queries involving commonly sorted or filtered data?

INDEX FIELD

data is entered the same way each time and as intended based on the business rules. Example of what quality dimension?

Consistency

what is the most common method to prepare data for a DWH? ETL or ELT?

ETL

Chapter 4 - Data Quality Flashcards

(103 cards)