Section B.1: Data Exploration/Data Cleansing Flashcards

1
Q

Data Exploration

A

The process of examining and understanding data in order to identify patterns, relationships, and trends that can be used to generate insights and support decision-making.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data exploration involves

A

Data exploration involves the collection, cleaning/pre-processing, visualisation, analysis, of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data Exploration processes (VMOVVCUBM)

A

Variable identification
Missing values treatment
Outlier treatment
Variable transformation
Variable creation
Check data stucture
Univariate analysis
Bivariate analysis
Multivariate analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Variable identification

A

Identifying: Variable type (dependent or idependent), Data type (numeric or character), and Variable category (categorical or continuous)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data Cleansing/Pre-processing

A

Data cleaning is the process of detecting and correcting anomalies within the data to ensure it is accurate, complete, and useful for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Missing values deletion

A

Missing values are deleted either:
List-wise: Removing the whole row for simplicity
Pair-wise: Analysing only the values that are present for keeping as many cases as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Missing values

A

Missing values are data that is missed because of data extraction problems, that can distort estimates, they can be classified in three types: MAR (Missing at Random), MNAR (Missing Not At Random), or MCAR (Missing Completely At Random).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Methods for treating missing values (3)

A

Mean/mode/median imputation
Prediction modelling
KNN imputation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Mean/median/mode imputation for missing value treatment?

A

Mean/mode/median imputation - estimating and replacing missing values based on the mean, median, or mode of all known values of that variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is prediction modelling for missing values treatment?

A

Predicting the missing values based on regression modelling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is KNN imputation?

A

kNN imputation uses an algorithm to impute the missing value based on the values nearest or most similar to the missing values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is an outlier?

A

An outlier is an extreme value within a dataset that diverges from the overall sample pattern, which can distort statistics, measures, models and graphs.
They are defined as values beyond ±1.5 x IQR, or 3 x σ from the mean (three sigma limit)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How would you detect outliers in a dataset?

A

Detect outliers by graphing the features or data points using a scatterplot or boxplot.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Feature/Variable creation?

A

Feature or Variable Creation is the process of generating a new variable or feature based on existing variables.

E.g. data$windchill <- windspeed/temp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the Data Structures (in R)?

A

All data in R is stored in an object, there are five different types of these objects, which are: Vector, Matrix, List, Data Frame, Factor)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Vector

A

A vector is what is called an array in all other programming languages except R - a collection of cells with a fixed size where all cells are the same type (numerical or characters)
created using an equation (below) or the c() function

> x = 1:7
x[1] 1 2 3 4 5 6 7

17
Q

Matrix

A

A matrix is a two-dimensional vector with a fixed size and all cell types the same. Created using the matrix() function

18
Q

List

A

A list is a data structure that can hold different types of data and is not a fixed size, and is created using the list() function.

19
Q

Factors

A

A factor is a data structure that can use levels to store predefined categorical data, and is created using the factor() function.

20
Q

Data Frame

A

A data frame is a data structure that is a table with rows and columns, where each column stores data of the same type, but different columns can store different data types. It is created using the data.frame()

21
Q

Data cleansing methods? (5)

A

Duplicate removal, missing values treatment, outlier treatment, consistency checking, format correction.

22
Q

Duplicate removal

A

Duplicate removal is the process of removing duplicate entries from a dataset. They should be removed as they may lead to bias, hurting the accuracy of estimates made about the data.

23
Q

What are the causes of missing values within a dataset? (2)

A

Data can have missing values for many reasons, which are categorised in two ways as data collection problems, and data extraction problems.

24
Q

What are the causes of outliers within a dataset?

A
  1. Bad data entry - unintentional or intentional error
  2. Measurement error
  3. Experimental/sampling error
  4. Processing error
  5. Natural outliers
25
Q

Methods for treating outliers? (4 DOST)

A
  1. Delete observations
  2. Split the model into two ( 1 w/, 1w/o)
  3. Overwrite (mean, median, mode imputation)
  4. Check for Typos