4 Cleaning and Processing Data Flashcards

1
Q

What is the primary issue with duplicate data in a dataset?

A

Duplicate data can cause issues with skew, bias, or completely invalidate your analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define duplicate data.

A

Duplicate data is when a specific data point recurs multiple times within a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the impact of duplicate data on descriptive statistics?

A

It can distort averages and percentages, leading to incorrect conclusions about the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is redundant data?

A

Redundant data refers to columns that can be used to perfectly predict other columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does redundant data differ from duplicate data?

A

Duplicate data is a copy of a row, whereas redundant data is a copy of a column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is multicollinearity?

A

Multicollinearity occurs when multiple independent variables in a model are highly correlated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a common approach to handle duplicate data?

A

The most common approach is to delete all duplicate rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some potential issues with having redundant data in a statistical model?

A

It can make results harder to interpret and can lead to inaccurate models when applied to the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is missing data?

A

Missing data refers to gaps in a dataset where no information is available for certain entries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is missing data problematic for data analysts?

A

Most analyses won’t run with null values, leading to errors and reduced statistical power.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the three main categories of missing data?

A
  • Missing Completely at Random (MCAR) * Missing at Random (MAR) * Missing Not at Random (MNAR)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does Missing Completely at Random (MCAR) mean?

A

Data is MCAR when there is no connection between the missing values and the present values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does Missing at Random (MAR) imply?

A

MAR means the missing data is related to another recorded variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Describe Missing Not at Random (MNAR).

A

MNAR occurs when the missing data is related to some unrecorded variable or factor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a recommended practice when working with datasets?

A

It is generally good practice to work on a copy of your data instead of the original.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What can happen if too much redundant data is included in a dataset?

A

It can lead to multicollinearity, complicating the interpretation of statistical models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Fill in the blank: Redundant data can lead to _______ in statistical models.

A

multicollinearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

True or False: All methods for dealing with missing data are universally accepted.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How can one create a subset of data excluding redundant columns?

A

By using functions like drop() to exclude the redundant variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the consequence of having missing data that is not random?

A

It can introduce bias into the results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the main reason for identifying the type of missing data?

A

It helps determine how much the missing data will influence the outcome and potential bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does MNAR stand for?

A

Missing Not At Random.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is a key characteristic of MNAR data?

A

It has a connection to some variable or type of information that was not recorded.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Why is MNAR data considered problematic?

A

It is the most likely to cause bias in results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the easiest approach to handle missing data?

A

Deletion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is a critical note to remember when working with deletion methods?

A

Always work on a copy of your data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is listwise deletion?

A

Deleting an entire observation if a single value is missing.

28
Q

What is pairwise deletion?

A

Deleting specific missing values while retaining the rest of the data in the row.

29
Q

When is variable deletion appropriate?

A

When over half of the values for a specific variable are missing.

30
Q

What is filtering in the context of missing data?

A

Removing values to create a subset of data that has no missing data.

31
Q

What is imputation?

A

Filling in missing data instead of removing it.

32
Q

What does mean, median, or mode imputation involve?

A

Estimating the middle of a dataset to fill in gaps.

33
Q

What is hot deck imputation?

A

Using random values from elsewhere in the dataset to fill in missing values.

34
Q

What is interpolation?

A

Estimating specific values for missing data points using other values as reference.

35
Q

Why is MNAR data difficult to handle with common methods?

A

There is an unknown reason for the missing data.

36
Q

What is invalid data?

A

Data that does not match expected values or ranges.

37
Q

What causes specification mismatch?

A

A value having a different data type than the other values in a variable.

38
Q

What is data type validation?

A

Checking the data type of a variable to avoid specification mismatches.

39
Q

What is non-parametric data?

A

Data that does not follow a normal or well-known distribution.

40
Q

What is an example of a common distribution in statistics?

A

Normal distribution.

41
Q

What can happen if a single value in a variable has the wrong data type?

A

It can cause errors in data analysis.

42
Q

What should you do if you identify invalid data?

A

Generate a list of unique values and check for discrepancies.

43
Q

Fill in the blank: Listwise deletion is also known as _______.

A

casewise deletion.

44
Q

True or False: Pairwise deletion is less likely to introduce bias compared to listwise deletion.

45
Q

What do you risk by deleting data that is not MCAR?

A

Introducing bias.

46
Q

What is a common method to reduce invalid data during data entry?

A

Using a drop-down menu.

47
Q

What is a normal distribution?

A

A distribution used to predict the probability that a new value will be any specific number.

48
Q

What characterizes non-parametric data?

A

Non-parametric data does not follow any of the common distributions.

49
Q

Why can non-parametric data be problematic?

A

Majority of common statistical analyses are inherently parametric and assume that data is in a specific distribution.

50
Q

What are distribution-free tests?

A

Statistical analyses that do not assume any specific distribution.

51
Q

What are outliers?

A

Data points that are significantly larger or smaller than the rest and can skew the entire dataset.

52
Q

What issue can a single outlier cause?

A

It can artificially pull all of the results up and to the right.

53
Q

How are outliers typically identified?

A

By calculating the standard deviation or the interquartile range (IQR).

54
Q

What is the common cutoff for identifying outliers using standard deviation?

A

More than three standard deviations away from the mean.

55
Q

What is the common cutoff for identifying outliers using IQR?

A

1.5 times the interquartile range (IQR).

56
Q

What is the pragmatic approach to handling outliers in data analytics?

A

Creating a range; anything outside of that range is considered an outlier and is deleted.

57
Q

What types of data issues are addressed in data cleaning?

A
  • Duplicate data
  • Redundant data
  • Missing data
  • Invalid data
  • Specification mismatch
  • Data type validation
58
Q

What is the importance of cleaning data?

A

It removes elements that will cause errors in analysis.

59
Q

What is the next chapter about after cleaning data?

A

Data wrangling and manipulation.

60
Q

Fill in the blank: Non-parametric data is a problem because you can’t use _______ analyses on it.

A

parametric

61
Q

True or False: All statistical analyses are suitable for non-parametric data.

62
Q

When identifying outliers, what should you avoid doing?

A

Eyeballing it and guessing.

63
Q

What is the typical method for deleting missing data?

A
  • Listwise
  • Pairwise
  • Variable
64
Q

What error type is indicated by a mismatch in department names, like ‘Sales’ and ‘Sale’?

A

Invalid data

65
Q

If you find a value of 8,000 lb in a dataset of human baby weights, what should you consider?

A

It is probably an outlier, and you should check your ranges to be sure.