Section B.1: Data Exploration/Data Cleansing Flashcards
Data Exploration
The process of examining and understanding data in order to identify patterns, relationships, and trends that can be used to generate insights and support decision-making.
Data exploration involves
Data exploration involves the collection, cleaning/pre-processing, visualisation, analysis, of data.
Data Exploration processes (VMOVVCUBM)
Variable identification
Missing values treatment
Outlier treatment
Variable transformation
Variable creation
Check data stucture
Univariate analysis
Bivariate analysis
Multivariate analysis
Variable identification
Identifying: Variable type (dependent or idependent), Data type (numeric or character), and Variable category (categorical or continuous)
Data Cleansing/Pre-processing
Data cleaning is the process of detecting and correcting anomalies within the data to ensure it is accurate, complete, and useful for analysis.
Missing values deletion
Missing values are deleted either:
List-wise: Removing the whole row for simplicity
Pair-wise: Analysing only the values that are present for keeping as many cases as possible
Missing values
Missing values are data that is missed because of data extraction problems, that can distort estimates, they can be classified in three types: MAR (Missing at Random), MNAR (Missing Not At Random), or MCAR (Missing Completely At Random).
Methods for treating missing values (3)
Mean/mode/median imputation
Prediction modelling
KNN imputation
What is Mean/median/mode imputation for missing value treatment?
Mean/mode/median imputation - estimating and replacing missing values based on the mean, median, or mode of all known values of that variable
What is prediction modelling for missing values treatment?
Predicting the missing values based on regression modelling
What is KNN imputation?
kNN imputation uses an algorithm to impute the missing value based on the values nearest or most similar to the missing values.
What is an outlier?
An outlier is an extreme value within a dataset that diverges from the overall sample pattern, which can distort statistics, measures, models and graphs.
They are defined as values beyond ±1.5 x IQR, or 3 x σ from the mean (three sigma limit)
How would you detect outliers in a dataset?
Detect outliers by graphing the features or data points using a scatterplot or boxplot.
What is Feature/Variable creation?
Feature or Variable Creation is the process of generating a new variable or feature based on existing variables.
E.g. data$windchill <- windspeed/temp
What are the Data Structures (in R)?
All data in R is stored in an object, there are five different types of these objects, which are: Vector, Matrix, List, Data Frame, Factor)