4 Cleaning and Processing Data Flashcards

Question 1

Q

What is the primary issue with duplicate data in a dataset?

Answer

A

Duplicate data can cause issues with skew, bias, or completely invalidate your analysis.

Question 2

Q

Define duplicate data.

Answer

A

Duplicate data is when a specific data point recurs multiple times within a dataset.

Question 3

Q

What is the impact of duplicate data on descriptive statistics?

Answer

A

It can distort averages and percentages, leading to incorrect conclusions about the dataset.

Question 4

Q

What is redundant data?

Answer

A

Redundant data refers to columns that can be used to perfectly predict other columns.

Question 5

Q

How does redundant data differ from duplicate data?

Answer

A

Duplicate data is a copy of a row, whereas redundant data is a copy of a column.

Question 6

Q

What is multicollinearity?

Answer

A

Multicollinearity occurs when multiple independent variables in a model are highly correlated.

Question 7

Q

What is a common approach to handle duplicate data?

Answer

A

The most common approach is to delete all duplicate rows.

Question 8

Q

What are some potential issues with having redundant data in a statistical model?

Answer

A

It can make results harder to interpret and can lead to inaccurate models when applied to the population.

Question 9

Q

What is missing data?

Answer

A

Missing data refers to gaps in a dataset where no information is available for certain entries.

Question 10

Q

Why is missing data problematic for data analysts?

Answer

A

Most analyses won’t run with null values, leading to errors and reduced statistical power.

Question 11

Q

What are the three main categories of missing data?

Answer

A

Missing Completely at Random (MCAR) * Missing at Random (MAR) * Missing Not at Random (MNAR)

Question 12

Q

What does Missing Completely at Random (MCAR) mean?

Answer

A

Data is MCAR when there is no connection between the missing values and the present values.

Question 13

Q

What does Missing at Random (MAR) imply?

Answer

A

MAR means the missing data is related to another recorded variable.

Question 14

Q

Describe Missing Not at Random (MNAR).

Answer

A

MNAR occurs when the missing data is related to some unrecorded variable or factor.

Question 15

Q

What is a recommended practice when working with datasets?

Answer

A

It is generally good practice to work on a copy of your data instead of the original.

Question 16

Q

What can happen if too much redundant data is included in a dataset?

Answer

A

It can lead to multicollinearity, complicating the interpretation of statistical models.

Question 17

Q

Fill in the blank: Redundant data can lead to _______ in statistical models.

Answer

A

multicollinearity

Question 18

Q

True or False: All methods for dealing with missing data are universally accepted.

Question 19

Q

How can one create a subset of data excluding redundant columns?

Answer

A

By using functions like drop() to exclude the redundant variables.

Question 20

Q

What is the consequence of having missing data that is not random?

Answer

A

It can introduce bias into the results.

Question 21

Q

What is the main reason for identifying the type of missing data?

Answer

A

It helps determine how much the missing data will influence the outcome and potential bias.

Question 22

Q

What does MNAR stand for?

Answer

A

Missing Not At Random.

Question 23

Q

What is a key characteristic of MNAR data?

Answer

A

It has a connection to some variable or type of information that was not recorded.

Question 24

Q

Why is MNAR data considered problematic?

Answer

A

It is the most likely to cause bias in results.

Question 25

Q

What is the easiest approach to handle missing data?

Answer

A

Deletion.

Question 26

Q

What is a critical note to remember when working with deletion methods?

Answer

A

Always work on a copy of your data.

Question 27

Q

What is listwise deletion?

Answer

A

Deleting an entire observation if a single value is missing.

Question 28

Q

What is pairwise deletion?

Answer

A

Deleting specific missing values while retaining the rest of the data in the row.

Question 29

Q

When is variable deletion appropriate?

Answer

A

When over half of the values for a specific variable are missing.

Question 30

Q

What is filtering in the context of missing data?

Answer

A

Removing values to create a subset of data that has no missing data.

Question 31

Q

What is imputation?

Answer

A

Filling in missing data instead of removing it.

Question 32

Q

What does mean, median, or mode imputation involve?

Answer

A

Estimating the middle of a dataset to fill in gaps.

Question 33

Q

What is hot deck imputation?

Answer

A

Using random values from elsewhere in the dataset to fill in missing values.

Question 34

Q

What is interpolation?

Answer

A

Estimating specific values for missing data points using other values as reference.

Question 35

Q

Why is MNAR data difficult to handle with common methods?

Answer

A

There is an unknown reason for the missing data.

Question 36

Q

What is invalid data?

Answer

A

Data that does not match expected values or ranges.

Question 37

Q

What causes specification mismatch?

Answer

A

A value having a different data type than the other values in a variable.

Question 38

Q

What is data type validation?

Answer

A

Checking the data type of a variable to avoid specification mismatches.

Question 39

Q

What is non-parametric data?

Answer

A

Data that does not follow a normal or well-known distribution.

Question 40

Q

What is an example of a common distribution in statistics?

Answer

A

Normal distribution.

Question 41

Q

What can happen if a single value in a variable has the wrong data type?

Answer

A

It can cause errors in data analysis.

Question 42

Q

What should you do if you identify invalid data?

Answer

A

Generate a list of unique values and check for discrepancies.

Question 43

Q

Fill in the blank: Listwise deletion is also known as _______.

Answer

A

casewise deletion.

Question 44

Q

True or False: Pairwise deletion is less likely to introduce bias compared to listwise deletion.

Question 45

Q

What do you risk by deleting data that is not MCAR?

Answer

A

Introducing bias.

Question 46

Q

What is a common method to reduce invalid data during data entry?

Answer

A

Using a drop-down menu.

Question 47

Q

What is a normal distribution?

Answer

A

A distribution used to predict the probability that a new value will be any specific number.

Question 48

Q

What characterizes non-parametric data?

Answer

A

Non-parametric data does not follow any of the common distributions.

Question 49

Q

Why can non-parametric data be problematic?

Answer

A

Majority of common statistical analyses are inherently parametric and assume that data is in a specific distribution.

Question 50

Q

What are distribution-free tests?

Answer

A

Statistical analyses that do not assume any specific distribution.

Question 51

Q

What are outliers?

Answer

A

Data points that are significantly larger or smaller than the rest and can skew the entire dataset.

Question 52

Q

What issue can a single outlier cause?

Answer

A

It can artificially pull all of the results up and to the right.

Question 53

Q

How are outliers typically identified?

Answer

A

By calculating the standard deviation or the interquartile range (IQR).

Question 54

Q

What is the common cutoff for identifying outliers using standard deviation?

Answer

A

More than three standard deviations away from the mean.

Question 55

Q

What is the common cutoff for identifying outliers using IQR?

Answer

A

1.5 times the interquartile range (IQR).

Question 56

Q

What is the pragmatic approach to handling outliers in data analytics?

Answer

A

Creating a range; anything outside of that range is considered an outlier and is deleted.

Question 57

Q

What types of data issues are addressed in data cleaning?

Answer

A

Duplicate data
Redundant data
Missing data
Invalid data
Specification mismatch
Data type validation

Question 58

Q

What is the importance of cleaning data?

Answer

A

It removes elements that will cause errors in analysis.

Question 59

Q

What is the next chapter about after cleaning data?

Answer

A

Data wrangling and manipulation.

Question 60

Q

Fill in the blank: Non-parametric data is a problem because you can’t use _______ analyses on it.

Answer

A

parametric

Question 61

Q

True or False: All statistical analyses are suitable for non-parametric data.

Question 62

Q

When identifying outliers, what should you avoid doing?

Answer

A

Eyeballing it and guessing.

Question 63

Q

What is the typical method for deleting missing data?

Answer

A

Listwise
Pairwise
Variable

Question 64

Q

What error type is indicated by a mismatch in department names, like ‘Sales’ and ‘Sale’?

Answer

A

Invalid data

Answer 62

A

It is probably an outlier, and you should check your ranges to be sure.