Module 1-5 - Data Analysis Flashcards

1
Q

Ways to deal with missing data? (3)

A

1) Replace missing values with valid values that don’t bias the data (mean, median, mode)
- Assuming that missing values are missing completely at random

2) Replace missing values with an assigned category/value
- Assign a special “code” for the missing values

3) Predict the missing value using the other variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Random sampling, description?

A

Draw random # of records (without replacement) from the dataset until you have the required #

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Stratified sampling, description?

A

Independently drawing a set of random records from each strata in your data
-Oversampling and undersampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Systematic sampling, description?

A

Drawing samples according to a pattern (ex: every 5th record)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Univariate data exploration, checklist? (4)

A

1) Understand basic relationships in data
- Common sense checks on model output

2) Identify potential data errors that could cause misleading models

3) Identify outliers
- Understand their potential effects on the model

4) Understand how target/response varies for different predictors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Ways of gaining insight into variable distributions? (2)

A

1) Calculate numeric statistics/summaries of the data

2) Visualizing values of a variable in a graphical image

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Types of variable combinations we can consider when examining bivariate relationships? (3)

A

1) Categorical vs Categorical
2) Categorical vs numeric
3) Numeric vs Numeric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why are outliers bad? (3)

A

1) Can hinder model’s ability to find the patterns in the data to predict an outcome
2) Model will try to optimize its predictive accuracy by minimizing prediction error
3) If too many outliers, model will bias its predictions toward the outliers in order to bring down the overall error of the prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Visualization of bivariate errors, which type of plot to use:

1) Numeric vs Numeric
2) Categorical vs numeric
3) Categorical vs Categorical

A

1) Scatterplot
2) Box Plot
3) Frequency Table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Checklist for when you do univariate data exploration?

A
  1. Range of the variable, different factor levels
  2. Skewness: mean vs median (numeric)
  3. Any transformations to do with it? Any new variables to create?
  4. Show the graphs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Visualization of bivariate errors, which type of plot to use:

1) Numeric vs Numeric

A

Scatterplot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Visualization of bivariate errors, which type of plot to use:

2) Categorical vs numeric

A

Boxplot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Visualization of bivariate errors, which type of plot to use:

3) Categorical vs Categorical

A

3) Frequency Table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly