Section B.1: Data Exploration/Data Cleansing Flashcards by Holly Hayes

Data Exploration

The process of examining and understanding data in order to identify patterns, relationships, and trends that can be used to generate insights and support decision-making.

How well did you know this?

Not at all

Perfectly

Data exploration involves

Data exploration involves the collection, cleaning/pre-processing, visualisation, analysis, of data.

How well did you know this?

Not at all

Perfectly

Data Exploration processes (VMOVVCUBM)

Variable identification
Missing values treatment
Outlier treatment
Variable transformation
Variable creation
Check data stucture
Univariate analysis
Bivariate analysis
Multivariate analysis

How well did you know this?

Not at all

Perfectly

Variable identification

Identifying: Variable type (dependent or idependent), Data type (numeric or character), and Variable category (categorical or continuous)

How well did you know this?

Not at all

Perfectly

Data Cleansing/Pre-processing

Data cleaning is the process of detecting and correcting anomalies within the data to ensure it is accurate, complete, and useful for analysis.

How well did you know this?

Not at all

Perfectly

Missing values deletion

Missing values are deleted either:
List-wise: Removing the whole row for simplicity
Pair-wise: Analysing only the values that are present for keeping as many cases as possible

How well did you know this?

Not at all

Perfectly

Missing values

Missing values are data that is missed because of data extraction problems, that can distort estimates, they can be classified in three types: MAR (Missing at Random), MNAR (Missing Not At Random), or MCAR (Missing Completely At Random).

How well did you know this?

Not at all

Perfectly

Methods for treating missing values (3)

Mean/mode/median imputation
Prediction modelling
KNN imputation

How well did you know this?

Not at all

Perfectly

What is Mean/median/mode imputation for missing value treatment?

Mean/mode/median imputation - estimating and replacing missing values based on the mean, median, or mode of all known values of that variable

How well did you know this?

Not at all

Perfectly

What is prediction modelling for missing values treatment?

Predicting the missing values based on regression modelling

How well did you know this?

Not at all

Perfectly

What is KNN imputation?

kNN imputation uses an algorithm to impute the missing value based on the values nearest or most similar to the missing values.

How well did you know this?

Not at all

Perfectly

What is an outlier?

An outlier is an extreme value within a dataset that diverges from the overall sample pattern, which can distort statistics, measures, models and graphs.
They are defined as values beyond ±1.5 x IQR, or 3 x σ from the mean (three sigma limit)

How well did you know this?

Not at all

Perfectly

How would you detect outliers in a dataset?

Detect outliers by graphing the features or data points using a scatterplot or boxplot.

How well did you know this?

Not at all

Perfectly

What is Feature/Variable creation?

Feature or Variable Creation is the process of generating a new variable or feature based on existing variables.

E.g. data$windchill <- windspeed/temp

How well did you know this?

Not at all

Perfectly

What are the Data Structures (in R)?

All data in R is stored in an object, there are five different types of these objects, which are: Vector, Matrix, List, Data Frame, Factor)

How well did you know this?

Not at all

Perfectly

Vector

Study These Flashcards

A vector is what is called an array in all other programming languages except R - a collection of cells with a fixed size where all cells are the same type (numerical or characters)
created using an equation (below) or the c() function

> x = 1:7
x[1] 1 2 3 4 5 6 7

Matrix

Study These Flashcards

A matrix is a two-dimensional vector with a fixed size and all cell types the same. Created using the matrix() function

List

Study These Flashcards

A list is a data structure that can hold different types of data and is not a fixed size, and is created using the list() function.

Factors

Study These Flashcards

A factor is a data structure that can use levels to store predefined categorical data, and is created using the factor() function.

Data Frame

Study These Flashcards

A data frame is a data structure that is a table with rows and columns, where each column stores data of the same type, but different columns can store different data types. It is created using the data.frame()

Data cleansing methods? (5)

Study These Flashcards

Duplicate removal, missing values treatment, outlier treatment, consistency checking, format correction.

Duplicate removal

Study These Flashcards

Duplicate removal is the process of removing duplicate entries from a dataset. They should be removed as they may lead to bias, hurting the accuracy of estimates made about the data.

What are the causes of missing values within a dataset? (2)

Study These Flashcards

Data can have missing values for many reasons, which are categorised in two ways as data collection problems, and data extraction problems.

What are the causes of outliers within a dataset?

Study These Flashcards

Bad data entry - unintentional or intentional error
Measurement error
Experimental/sampling error
Processing error
Natural outliers

Methods for treating outliers? (4 DOST)

1. Delete observations 2. Split the model into two ( 1 w/, 1w/o) 3. Overwrite (mean, median, mode imputation) 4. Check for Typos

Section B.1: Data Exploration/Data Cleansing Flashcards

(25 cards)