Data Wrangling Flashcards by MG _

What is Data Wrangling?

It is the data exploration, transformation and validation process that involves preparing raw data for a clearly defined purpose, where raw data at this stage is data that has been collated through various data sources in a data repository.

How well did you know this?

Not at all

Perfectly

4 Step Process for Data Wrangling

Discovery.
Transformation.
Validation.
Publishing.

How well did you know this?

Not at all

Perfectly

Discovery Phase

Is about examining and understanding your data, base on your use case and creating a plan for cleaning, structuring, organizing, and mapping your data.

How well did you know this?

Not at all

Perfectly

Transformation Phase

Forms the whole of data wrangling process. It involves the tasks you undertake to transform the data, such as structuring, normalizing, denormalizing, cleaning, and enriching the data.

How well did you know this?

Not at all

Perfectly

(Part of Transformation Phase) Structuring Data

This task includes actions that change the form and schema of your data, which can be in Joins and Unions.

How well did you know this?

Not at all

Perfectly

Joins

Combine columns. When columns from the first source table are combined with columns from the second source table, coming up with 1 same column row.

How well did you know this?

Not at all

Perfectly

Unions

Combine rows. Rows of data from the first source table are combined with rows of data from the second source table, turning it into a single table.

How well did you know this?

Not at all

Perfectly

(Part of Transformation Phase) Normalizing Data

includes cleaning not used data, reducing redundancy, and reducing inconsistency.

How well did you know this?

Not at all

Perfectly

(Part of Transformation Phase) Denormalization Data

Is the process of combine data from multiple tables into a single table for faster querying of data for reports and analysis.

How well did you know this?

Not at all

Perfectly

(Part of Transformation Phase) Cleaning Data

are actions that fix irregularities in data in order to produce a credible and accurate analysis.

How well did you know this?

Not at all

Perfectly

are actions that fix irregularities in data in order to produce a credible and accurate analysis.

is the adding of data points to make your analysis more meaningful.

How well did you know this?

Not at all

Perfectly

Validation Phase

Validation rules refer to programming steps used to verify the consistency, quality, and security of the data we have after being structured, normalized, denormalized cleaned, and enriched.

How well did you know this?

Not at all

Perfectly

Publishing Phase

Is the transformed and validated version of the dataset along with the metadata about the dataset that would be delivered for downstream project needs.

How well did you know this?

Not at all

Perfectly

Some popular Data Wrangling software and tools:

Excel Spreadsheets.
OpenRefine.
Google DataPrep.
Watson Studio Refinery.
Trifacta Wrangler.
Python.
R.

How well did you know this?

Not at all

Perfectly

Excel Spreadsheets

Microsoft Excel and Google Sheets have features and in-built formulas that can help identity issues, clean, and transform data. They allow you to import data from several different sources, cleaning, and transforming data as needed.

How well did you know this?

Not at all

Perfectly

OpenRefine

Study These Flashcards

It is an Open-source tool that allows to clean, transform, import and export data in a variety of formats, such as TSV, CSV, XLS, XML, and JSON. It contains a menu-based operations (we don’t nee to memorize commands or syntax’s).

Google DataPrep

Study These Flashcards

Is an intelligent cloud data service that allows to visually explore, clean, and prepare structured and unstructured data for analysis. This system offers suggestion on ideal next steps and automatically detects schemas, data types, and anomalies.

Watson Studio Refinery

Study These Flashcards

This software its able to detect data types, it auto-enforces data governance policies, transforms large amounts of raw data into consumable for analysis, and cleanses.

Trifacta Wrangler

Study These Flashcards

Its a cloud-based service that cleans messy real-world data, arranges data into data tables, transform data, and allows multiple members to work simultaneously.

Python

Study These Flashcards

Contains a huge library and set of packages that offer powerful data manipulation capabilities, such as:
- Jupyter.
- NumPy.
- Pandas.

Study These Flashcards

Offers a series of libraries and packages that are explicitly created for wrangling messy data. Using these libraries you can investigate, manipulate, and analyze data. Some of the libraries are:
- Dplyr.
- Data.Table.
- Jsonlite.

Cleaning workflow includes

Study These Flashcards

Inspection.
Cleaning.
Verification.

(Cleaning workflow) Inspection

Study These Flashcards

Its the process of detect different types of errors that the database may have, figure out structure and content of the data, visualize data, and use scrips to validate data.

(Cleaning workflow) Cleaning

Study These Flashcards

The techniques use for cleaning datasets will depend on the case and the type of issues encounter. Some of the common data issues are:
- Missing Values.
- Duplicate Data.
- Data type conversion.
- Irrelevant Data.
- Standardizing Data.
- Syntax Errors.
- Outliers.

Data type conversion

needed to ensure that values in a field are stored as the data type of that field

Standardizing Data

needed to ensure data-time formats are standard across the dataset.

Syntax errors

Outtliers

are values that are vastly different from other observations in the data set, they need to be examined for accuracy in the dataset because they may be correct or incorrect.

(Cleaning workflow) Verification

In this step we inspect results to establish effectiveness and accuracy achieved as a result of data cleaning. We need to re-inspect the data to make sure the rules and constraints applicable on the data still hold after the corrections you made.

Documentation

Its the process of write down all changes and reasons behind making changes into a document for referral in the future with similar cases.

Data Wrangling Flashcards

(30 cards)