Section B.1: Data Exploration/Data Cleansing Flashcards
Data Exploration
The process of examining and understanding data in order to identify patterns, relationships, and trends that can be used to generate insights and support decision-making.
Data exploration involves
Data exploration involves the collection, cleaning/pre-processing, visualisation, analysis, of data.
Data Exploration processes (VMOVVCUBM)
Variable identification
Missing values treatment
Outlier treatment
Variable transformation
Variable creation
Check data stucture
Univariate analysis
Bivariate analysis
Multivariate analysis
Variable identification
Identifying: Variable type (dependent or idependent), Data type (numeric or character), and Variable category (categorical or continuous)
Data Cleansing/Pre-processing
Data cleaning is the process of detecting and correcting anomalies within the data to ensure it is accurate, complete, and useful for analysis.
Missing values deletion
Missing values are deleted either:
List-wise: Removing the whole row for simplicity
Pair-wise: Analysing only the values that are present for keeping as many cases as possible
Missing values
Missing values are data that is missed because of data extraction problems, that can distort estimates, they can be classified in three types: MAR (Missing at Random), MNAR (Missing Not At Random), or MCAR (Missing Completely At Random).
Methods for treating missing values (3)
Mean/mode/median imputation
Prediction modelling
KNN imputation
What is Mean/median/mode imputation for missing value treatment?
Mean/mode/median imputation - estimating and replacing missing values based on the mean, median, or mode of all known values of that variable
What is prediction modelling for missing values treatment?
Predicting the missing values based on regression modelling
What is KNN imputation?
kNN imputation uses an algorithm to impute the missing value based on the values nearest or most similar to the missing values.
What is an outlier?
An outlier is an extreme value within a dataset that diverges from the overall sample pattern, which can distort statistics, measures, models and graphs.
They are defined as values beyond ±1.5 x IQR, or 3 x σ from the mean (three sigma limit)
How would you detect outliers in a dataset?
Detect outliers by graphing the features or data points using a scatterplot or boxplot.
What is Feature/Variable creation?
Feature or Variable Creation is the process of generating a new variable or feature based on existing variables.
E.g. data$windchill <- windspeed/temp
What are the Data Structures (in R)?
All data in R is stored in an object, there are five different types of these objects, which are: Vector, Matrix, List, Data Frame, Factor)
Vector
A vector is what is called an array in all other programming languages except R - a collection of cells with a fixed size where all cells are the same type (numerical or characters)
created using an equation (below) or the c() function
> x = 1:7
x[1] 1 2 3 4 5 6 7
Matrix
A matrix is a two-dimensional vector with a fixed size and all cell types the same. Created using the matrix() function
List
A list is a data structure that can hold different types of data and is not a fixed size, and is created using the list() function.
Factors
A factor is a data structure that can use levels to store predefined categorical data, and is created using the factor() function.
Data Frame
A data frame is a data structure that is a table with rows and columns, where each column stores data of the same type, but different columns can store different data types. It is created using the data.frame()
Data cleansing methods? (5)
Duplicate removal, missing values treatment, outlier treatment, consistency checking, format correction.
Duplicate removal
Duplicate removal is the process of removing duplicate entries from a dataset. They should be removed as they may lead to bias, hurting the accuracy of estimates made about the data.
What are the causes of missing values within a dataset? (2)
Data can have missing values for many reasons, which are categorised in two ways as data collection problems, and data extraction problems.
What are the causes of outliers within a dataset?
- Bad data entry - unintentional or intentional error
- Measurement error
- Experimental/sampling error
- Processing error
- Natural outliers
Methods for treating outliers? (4 DOST)
- Delete observations
- Split the model into two ( 1 w/, 1w/o)
- Overwrite (mean, median, mode imputation)
- Check for Typos