Data Wrangling Flashcards

1
Q

What is Data Wrangling?

A

It is the data exploration, transformation and validation process that involves preparing raw data for a clearly defined purpose, where raw data at this stage is data that has been collated through various data sources in a data repository.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

4 Step Process for Data Wrangling

A
  • Discovery.
  • Transformation.
  • Validation.
  • Publishing.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Discovery Phase

A

Is about examining and understanding your data, base on your use case and creating a plan for cleaning, structuring, organizing, and mapping your data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Transformation Phase

A

Forms the whole of data wrangling process. It involves the tasks you undertake to transform the data, such as structuring, normalizing, denormalizing, cleaning, and enriching the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

(Part of Transformation Phase) Structuring Data

A

This task includes actions that change the form and schema of your data, which can be in Joins and Unions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Joins

A

Combine columns. When columns from the first source table are combined with columns from the second source table, coming up with 1 same column row.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Unions

A

Combine rows. Rows of data from the first source table are combined with rows of data from the second source table, turning it into a single table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

(Part of Transformation Phase) Normalizing Data

A

includes cleaning not used data, reducing redundancy, and reducing inconsistency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

(Part of Transformation Phase) Denormalization Data

A

Is the process of combine data from multiple tables into a single table for faster querying of data for reports and analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

(Part of Transformation Phase) Cleaning Data

A

are actions that fix irregularities in data in order to produce a credible and accurate analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

are actions that fix irregularities in data in order to produce a credible and accurate analysis.

A

is the adding of data points to make your analysis more meaningful.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Validation Phase

A

Validation rules refer to programming steps used to verify the consistency, quality, and security of the data we have after being structured, normalized, denormalized cleaned, and enriched.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Publishing Phase

A

Is the transformed and validated version of the dataset along with the metadata about the dataset that would be delivered for downstream project needs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Some popular Data Wrangling software and tools:

A
  • Excel Spreadsheets.
  • OpenRefine.
  • Google DataPrep.
  • Watson Studio Refinery.
  • Trifacta Wrangler.
  • Python.
  • R.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Excel Spreadsheets

A

Microsoft Excel and Google Sheets have features and in-built formulas that can help identity issues, clean, and transform data. They allow you to import data from several different sources, cleaning, and transforming data as needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

OpenRefine

A

It is an Open-source tool that allows to clean, transform, import and export data in a variety of formats, such as TSV, CSV, XLS, XML, and JSON. It contains a menu-based operations (we don’t nee to memorize commands or syntax’s).

17
Q

Google DataPrep

A

Is an intelligent cloud data service that allows to visually explore, clean, and prepare structured and unstructured data for analysis. This system offers suggestion on ideal next steps and automatically detects schemas, data types, and anomalies.

18
Q

Watson Studio Refinery

A

This software its able to detect data types, it auto-enforces data governance policies, transforms large amounts of raw data into consumable for analysis, and cleanses.

19
Q

Trifacta Wrangler

A

Its a cloud-based service that cleans messy real-world data, arranges data into data tables, transform data, and allows multiple members to work simultaneously.

20
Q

Python

A

Contains a huge library and set of packages that offer powerful data manipulation capabilities, such as:
- Jupyter.
- NumPy.
- Pandas.

21
Q

R

A

Offers a series of libraries and packages that are explicitly created for wrangling messy data. Using these libraries you can investigate, manipulate, and analyze data. Some of the libraries are:
- Dplyr.
- Data.Table.
- Jsonlite.

22
Q

Cleaning workflow includes

A
  • Inspection.
  • Cleaning.
  • Verification.
23
Q

(Cleaning workflow) Inspection

A

Its the process of detect different types of errors that the database may have, figure out structure and content of the data, visualize data, and use scrips to validate data.

24
Q

(Cleaning workflow) Cleaning

A

The techniques use for cleaning datasets will depend on the case and the type of issues encounter. Some of the common data issues are:
- Missing Values.
- Duplicate Data.
- Data type conversion.
- Irrelevant Data.
- Standardizing Data.
- Syntax Errors.
- Outliers.

25
Q

Data type conversion

A

needed to ensure that values in a field are stored as the data type of that field

26
Q

Standardizing Data

A

needed to ensure data-time formats are standard across the dataset.

27
Q

Syntax errors

A

Syntax errors

28
Q

Outtliers

A

are values that are vastly different from other observations in the data set, they need to be examined for accuracy in the dataset because they may be correct or incorrect.

29
Q

(Cleaning workflow) Verification

A

In this step we inspect results to establish effectiveness and accuracy achieved as a result of data cleaning. We need to re-inspect the data to make sure the rules and constraints applicable on the data still hold after the corrections you made.

30
Q

Documentation

A

Its the process of write down all changes and reasons behind making changes into a document for referral in the future with similar cases.