2.4: Preparing Data for Analysis Flashcards

1
Q

What are some considerations for ensuring data quality regarding data types?

A

Ensure that data types for each attribute are appropriate, and check all date, numerical, and text data types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is it important to check date data types, and what are some common date format issues?

A

Checking date data types is important because date formats can vary widely.

Common date format issues include different representations of the same date (e.g., July 6, 2022), which can be addressed by formatting dates according to international standards, such as ISO 8601.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What should you look for when checking numerical data types?

A

When checking numerical data types, watch for misinterpretations of numbers, invalid number formats, and numerical artifacts (e.g., dollar signs, commas) that should be removed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How should you handle text data types with international characters, language differences, and measures?

A

For text data types with international characters and encoding issues, you can use find and replace or place characters in quotation marks to address problems.

Regarding words and measures, standardize them to a common format and ensure that measures do not change the meaning (e.g., comparing values in U.S. dollars to values in euros).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the second step in preparing data for analysis?

A

The second step is to validate the data for completeness and integrity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does “data completeness” refer to in the context of data preparation?

A

Data completeness means ensuring that the data you wish to analyze were fully extracted from their original source without any missing records or information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does “data integrity” mean in the context of data preparation?

A

Data integrity means verifying that none of the data were manipulated or tampered with during the extraction process, ensuring that the data remains accurate and reliable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can you use a checksum in data preparation?

A

A checksum is a digit representing the sum of the correct digits in a piece of extracted data. It can be used to help detect errors in data extraction and transmission, ensuring that the data remains intact and accurate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the four steps to validate data after extraction?

A

Compare the number of records extracted to the number of records in the source data.

Compare descriptive statistics for numerical fields, including minimums and maximums.

Validate date/time fields by identifying the earliest and latest valid dates.

Compare string limits for text fields to ensure no characters were cut off during extraction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the process for comparing string limits for text fields and validating data after extraction?

A

Compare the string limits of text fields in the extracted data to the source database’s limits per field.

Ensure that no characters were cut off during the extraction process.
If errors are found:

For small datasets, visually inspect the data for missing or erroneous entries.

For large datasets or challenging errors, examine how the data were extracted, particularly when using Structured Query Language (SQL) for extraction.

Assess the SQL code for errors, fix them, and re-run the extraction if necessary.

This process helps verify the completeness and integrity of the extracted data, especially regarding text fields. If discrepancies are identified, steps are taken to rectify them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the key steps to cleanse data after extraction and validation?

A

Remove headings and subtotals.

Remove leading zeroes and nonprintable characters.

Format negative numbers correctly.

Correct inconsistencies in data formatting.

Decide how to handle missing values: leave as is, remove records, or impute values based on the analysis requirements and the proportion of missing data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the first technique used to cleanse data after extraction and validation?

A

Remove headings and subtotals to eliminate unnecessary elements that may have been included in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can you address issues related to leading zeroes and nonprintable characters in data?

A

Remove leading zeroes from numbers or dates and identify/remove nonprintable characters using functions like TRIM and CLEAN in software packages like Excel, Power BI, and Tableau.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the recommended formatting for negative numbers to ensure proper analysis?

A

Format negative numbers using the negative sign (e.g., -5.50) instead of parentheses (e.g., (5.50)).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How should inconsistencies in data formatting (e.g., state abbreviations) be addressed?

A

Choose a common value and make formatting consistent across the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the three options for dealing with missing values in data, and when might each be appropriate?

A

Leave missing values as they are when they provide meaningful insights or when not having data in certain fields is acceptable.

Remove records with missing values when analysis requires complete data and removing a small proportion of records won’t significantly impact results.

Impute values to replace missing data when a substantial amount of data is missing, but avoid imputation when more than 60% of values are missing to prevent distorting the data shape.

17
Q

Trim functions on Tableau

18
Q

Trim and replace functions tableau

19
Q

What is the final step in preparing data for analysis?

A

The final step is to perform preliminary exploratory analysis, which involves exploring the data to ensure data quality, validate data completeness and integrity, and assess the relevance of the initial question identified in the SOAR model.

This step may include examining descriptive statistics and data attributes to gain insights into the data’s shape and relationships.

20
Q

The four steps of preparing data for analysis