Module 1-5 - Data Analysis Flashcards
Ways to deal with missing data? (3)
1) Replace missing values with valid values that don’t bias the data (mean, median, mode)
- Assuming that missing values are missing completely at random
2) Replace missing values with an assigned category/value
- Assign a special “code” for the missing values
3) Predict the missing value using the other variables
Random sampling, description?
Draw random # of records (without replacement) from the dataset until you have the required #
Stratified sampling, description?
Independently drawing a set of random records from each strata in your data
-Oversampling and undersampling
Systematic sampling, description?
Drawing samples according to a pattern (ex: every 5th record)
Univariate data exploration, checklist? (4)
1) Understand basic relationships in data
- Common sense checks on model output
2) Identify potential data errors that could cause misleading models
3) Identify outliers
- Understand their potential effects on the model
4) Understand how target/response varies for different predictors
Ways of gaining insight into variable distributions? (2)
1) Calculate numeric statistics/summaries of the data
2) Visualizing values of a variable in a graphical image
Types of variable combinations we can consider when examining bivariate relationships? (3)
1) Categorical vs Categorical
2) Categorical vs numeric
3) Numeric vs Numeric
Why are outliers bad? (3)
1) Can hinder model’s ability to find the patterns in the data to predict an outcome
2) Model will try to optimize its predictive accuracy by minimizing prediction error
3) If too many outliers, model will bias its predictions toward the outliers in order to bring down the overall error of the prediction
Visualization of bivariate errors, which type of plot to use:
1) Numeric vs Numeric
2) Categorical vs numeric
3) Categorical vs Categorical
1) Scatterplot
2) Box Plot
3) Frequency Table
Checklist for when you do univariate data exploration?
- Range of the variable, different factor levels
- Skewness: mean vs median (numeric)
- Any transformations to do with it? Any new variables to create?
- Show the graphs
Visualization of bivariate errors, which type of plot to use:
1) Numeric vs Numeric
Scatterplot
Visualization of bivariate errors, which type of plot to use:
2) Categorical vs numeric
Boxplot
Visualization of bivariate errors, which type of plot to use:
3) Categorical vs Categorical
3) Frequency Table