Lesson 2 Flashcards

1
Q

What happens during data cleaning?

A
  • Ensuring an analyzable format (dummy coding).
  • Ensuring the data takes legal values.
  • Outliers are located and treated.
  • Missing data is located and treated.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What missing data patterns are there?

A
  • Univariate (only on one variable/column)
  • Monotone (can be arranged into a staircase pattern)
  • Arbitrary (randomly scattered)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What 5 rates of nonresponse are there?

A
  • Percent/proportion missing (screening, apply on each variable)
  • Attrition rate (data that is removed)
  • Percent/proportion of complete cases (percent without missing data)
  • Covariance coverage (pairwise proportion)
  • Fraction of missing information (proportion of all the missing data points)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Missing Completely at Random (MCAR)?

A

Missing data is completely randomly spread

P(R|Y mis, Y obs) = P (R)

Best possibility

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Missing at Random (MAR)?

A

Missing data has a reason, but it is not directly related to the thing that is being researched.

Frogs can’t be caught because they are too smart, but the research is about length.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Missing Not at Random (MNAR)?

A

When data is missing because a certain group doesn’t answer for a reason that is directly related to the research.

Rich people don’t talk about their salary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are 6 Bad methods to handle missing data?

A
  • Listwise deletion (only using complete rows)
  • Pairwise Deletion (only use complete pairs of observations)
  • Mean Substitution (use mean for every missing value)
  • Deterministic Regression Imputation (using regression data to replace missing values)
  • Averaging Available Items (data must be MCAR)
  • Last Observation Carried Forward (LOCF)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are 2 Good methods to handle missing data?

A
  • Stochastic Regression Imputation (fill missing data with some random noise)
  • Multiple Imputation (STI with extra steps ((com expensive))
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are univariate outliers?

A

Extreme values with respect to the distribution of a variable’s other observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What causes an outlier?

A
  • Data entry error
  • legal but extreme values

It is important to use outliers that are correct and generalize to the findings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 4 most popular methods to diagnose outliers?

A
  1. Internally studentized residuals (AKA Z-score method)
  2. Externally studentized residuals
  3. Median absolute deviation method
  4. Tukey’s boxplot method
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can you calculate Internally Studentized Residuals?

A

It is calculated using the mean and the SD of the sample. Then a cut-off is used to determine an outlier

It is important that the sample is normally distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can you calculate Externally Studentized Residuals?

A

It uses the same calculation as the Internally Studentized Residuals, but does not include the identified outliers.

Uses deletion mean and deletion SD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can you calculate Median Absolute Deviation Method?

A

It uses the median instead of the mean, and it uses the absolute SD. This way it is more resistant to outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the Boxplot Method?

A

It is a visualization method that calculates the inner fence and outer fence.

values outside inner fence = possible outliers, outside outer fence = probable otliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a Breakdown Point?

A

The minimum proportion of cases that must be replaced by infinity to cause the value of the statistic to go to infinity

17
Q

What is a multivariate outlier?

A

A combination of values that is very unlikely

Someone that is in the longest, but the least in weight

18
Q

What is the Mahalanobis Distance?

A

This is a method for detecting multivariate outliers, that is based on the internally studentized residual.

18
Q

What is the difference beween Classical and Robust Mahlanobis Distance?

A

Robust is less influenced by outliers

19
Q

What is Winsorization?

A

Replacing a value (outlier or missing) with the nearest non-outlying value.

20
Q

Why do we want outlier detection methods to be Robust?

A

Robustness makes it less sensitive to outliers themselves

21
Q

Sort these detection methods by breakdown point:
- Boxplot method
- externaly studentized residual method
- internally studentized residual method
- median absolute deviation method

A

1/N for mean, 2/N for deletion mean, 50% for median, and 25% for the Boxplot method.