4 Cleaning and Processing Data Flashcards
What is the primary issue with duplicate data in a dataset?
Duplicate data can cause issues with skew, bias, or completely invalidate your analysis.
Define duplicate data.
Duplicate data is when a specific data point recurs multiple times within a dataset.
What is the impact of duplicate data on descriptive statistics?
It can distort averages and percentages, leading to incorrect conclusions about the dataset.
What is redundant data?
Redundant data refers to columns that can be used to perfectly predict other columns.
How does redundant data differ from duplicate data?
Duplicate data is a copy of a row, whereas redundant data is a copy of a column.
What is multicollinearity?
Multicollinearity occurs when multiple independent variables in a model are highly correlated.
What is a common approach to handle duplicate data?
The most common approach is to delete all duplicate rows.
What are some potential issues with having redundant data in a statistical model?
It can make results harder to interpret and can lead to inaccurate models when applied to the population.
What is missing data?
Missing data refers to gaps in a dataset where no information is available for certain entries.
Why is missing data problematic for data analysts?
Most analyses won’t run with null values, leading to errors and reduced statistical power.
What are the three main categories of missing data?
- Missing Completely at Random (MCAR) * Missing at Random (MAR) * Missing Not at Random (MNAR)
What does Missing Completely at Random (MCAR) mean?
Data is MCAR when there is no connection between the missing values and the present values.
What does Missing at Random (MAR) imply?
MAR means the missing data is related to another recorded variable.
Describe Missing Not at Random (MNAR).
MNAR occurs when the missing data is related to some unrecorded variable or factor.
What is a recommended practice when working with datasets?
It is generally good practice to work on a copy of your data instead of the original.
What can happen if too much redundant data is included in a dataset?
It can lead to multicollinearity, complicating the interpretation of statistical models.
Fill in the blank: Redundant data can lead to _______ in statistical models.
multicollinearity
True or False: All methods for dealing with missing data are universally accepted.
False
How can one create a subset of data excluding redundant columns?
By using functions like drop() to exclude the redundant variables.
What is the consequence of having missing data that is not random?
It can introduce bias into the results.
What is the main reason for identifying the type of missing data?
It helps determine how much the missing data will influence the outcome and potential bias.
What does MNAR stand for?
Missing Not At Random.
What is a key characteristic of MNAR data?
It has a connection to some variable or type of information that was not recorded.
Why is MNAR data considered problematic?
It is the most likely to cause bias in results.
What is the easiest approach to handle missing data?
Deletion.
What is a critical note to remember when working with deletion methods?
Always work on a copy of your data.
What is listwise deletion?
Deleting an entire observation if a single value is missing.
What is pairwise deletion?
Deleting specific missing values while retaining the rest of the data in the row.
When is variable deletion appropriate?
When over half of the values for a specific variable are missing.
What is filtering in the context of missing data?
Removing values to create a subset of data that has no missing data.
What is imputation?
Filling in missing data instead of removing it.
What does mean, median, or mode imputation involve?
Estimating the middle of a dataset to fill in gaps.
What is hot deck imputation?
Using random values from elsewhere in the dataset to fill in missing values.
What is interpolation?
Estimating specific values for missing data points using other values as reference.
Why is MNAR data difficult to handle with common methods?
There is an unknown reason for the missing data.
What is invalid data?
Data that does not match expected values or ranges.
What causes specification mismatch?
A value having a different data type than the other values in a variable.
What is data type validation?
Checking the data type of a variable to avoid specification mismatches.
What is non-parametric data?
Data that does not follow a normal or well-known distribution.
What is an example of a common distribution in statistics?
Normal distribution.
What can happen if a single value in a variable has the wrong data type?
It can cause errors in data analysis.
What should you do if you identify invalid data?
Generate a list of unique values and check for discrepancies.
Fill in the blank: Listwise deletion is also known as _______.
casewise deletion.
True or False: Pairwise deletion is less likely to introduce bias compared to listwise deletion.
True.
What do you risk by deleting data that is not MCAR?
Introducing bias.
What is a common method to reduce invalid data during data entry?
Using a drop-down menu.
What is a normal distribution?
A distribution used to predict the probability that a new value will be any specific number.
What characterizes non-parametric data?
Non-parametric data does not follow any of the common distributions.
Why can non-parametric data be problematic?
Majority of common statistical analyses are inherently parametric and assume that data is in a specific distribution.
What are distribution-free tests?
Statistical analyses that do not assume any specific distribution.
What are outliers?
Data points that are significantly larger or smaller than the rest and can skew the entire dataset.
What issue can a single outlier cause?
It can artificially pull all of the results up and to the right.
How are outliers typically identified?
By calculating the standard deviation or the interquartile range (IQR).
What is the common cutoff for identifying outliers using standard deviation?
More than three standard deviations away from the mean.
What is the common cutoff for identifying outliers using IQR?
1.5 times the interquartile range (IQR).
What is the pragmatic approach to handling outliers in data analytics?
Creating a range; anything outside of that range is considered an outlier and is deleted.
What types of data issues are addressed in data cleaning?
- Duplicate data
- Redundant data
- Missing data
- Invalid data
- Specification mismatch
- Data type validation
What is the importance of cleaning data?
It removes elements that will cause errors in analysis.
What is the next chapter about after cleaning data?
Data wrangling and manipulation.
Fill in the blank: Non-parametric data is a problem because you can’t use _______ analyses on it.
parametric
True or False: All statistical analyses are suitable for non-parametric data.
False
When identifying outliers, what should you avoid doing?
Eyeballing it and guessing.
What is the typical method for deleting missing data?
- Listwise
- Pairwise
- Variable
What error type is indicated by a mismatch in department names, like ‘Sales’ and ‘Sale’?
Invalid data
If you find a value of 8,000 lb in a dataset of human baby weights, what should you consider?
It is probably an outlier, and you should check your ranges to be sure.