Data Wrangling Flashcards
What is Data Wrangling?
It is the data exploration, transformation and validation process that involves preparing raw data for a clearly defined purpose, where raw data at this stage is data that has been collated through various data sources in a data repository.
4 Step Process for Data Wrangling
- Discovery.
- Transformation.
- Validation.
- Publishing.
Discovery Phase
Is about examining and understanding your data, base on your use case and creating a plan for cleaning, structuring, organizing, and mapping your data.
Transformation Phase
Forms the whole of data wrangling process. It involves the tasks you undertake to transform the data, such as structuring, normalizing, denormalizing, cleaning, and enriching the data.
(Part of Transformation Phase) Structuring Data
This task includes actions that change the form and schema of your data, which can be in Joins and Unions.
Joins
Combine columns. When columns from the first source table are combined with columns from the second source table, coming up with 1 same column row.
Unions
Combine rows. Rows of data from the first source table are combined with rows of data from the second source table, turning it into a single table.
(Part of Transformation Phase) Normalizing Data
includes cleaning not used data, reducing redundancy, and reducing inconsistency.
(Part of Transformation Phase) Denormalization Data
Is the process of combine data from multiple tables into a single table for faster querying of data for reports and analysis.
(Part of Transformation Phase) Cleaning Data
are actions that fix irregularities in data in order to produce a credible and accurate analysis.
are actions that fix irregularities in data in order to produce a credible and accurate analysis.
is the adding of data points to make your analysis more meaningful.
Validation Phase
Validation rules refer to programming steps used to verify the consistency, quality, and security of the data we have after being structured, normalized, denormalized cleaned, and enriched.
Publishing Phase
Is the transformed and validated version of the dataset along with the metadata about the dataset that would be delivered for downstream project needs.
Some popular Data Wrangling software and tools:
- Excel Spreadsheets.
- OpenRefine.
- Google DataPrep.
- Watson Studio Refinery.
- Trifacta Wrangler.
- Python.
- R.
Excel Spreadsheets
Microsoft Excel and Google Sheets have features and in-built formulas that can help identity issues, clean, and transform data. They allow you to import data from several different sources, cleaning, and transforming data as needed.
OpenRefine
It is an Open-source tool that allows to clean, transform, import and export data in a variety of formats, such as TSV, CSV, XLS, XML, and JSON. It contains a menu-based operations (we don’t nee to memorize commands or syntax’s).
Google DataPrep
Is an intelligent cloud data service that allows to visually explore, clean, and prepare structured and unstructured data for analysis. This system offers suggestion on ideal next steps and automatically detects schemas, data types, and anomalies.
Watson Studio Refinery
This software its able to detect data types, it auto-enforces data governance policies, transforms large amounts of raw data into consumable for analysis, and cleanses.
Trifacta Wrangler
Its a cloud-based service that cleans messy real-world data, arranges data into data tables, transform data, and allows multiple members to work simultaneously.
Python
Contains a huge library and set of packages that offer powerful data manipulation capabilities, such as:
- Jupyter.
- NumPy.
- Pandas.
R
Offers a series of libraries and packages that are explicitly created for wrangling messy data. Using these libraries you can investigate, manipulate, and analyze data. Some of the libraries are:
- Dplyr.
- Data.Table.
- Jsonlite.
Cleaning workflow includes
- Inspection.
- Cleaning.
- Verification.
(Cleaning workflow) Inspection
Its the process of detect different types of errors that the database may have, figure out structure and content of the data, visualize data, and use scrips to validate data.
(Cleaning workflow) Cleaning
The techniques use for cleaning datasets will depend on the case and the type of issues encounter. Some of the common data issues are:
- Missing Values.
- Duplicate Data.
- Data type conversion.
- Irrelevant Data.
- Standardizing Data.
- Syntax Errors.
- Outliers.
Data type conversion
needed to ensure that values in a field are stored as the data type of that field
Standardizing Data
needed to ensure data-time formats are standard across the dataset.
Syntax errors
Syntax errors
Outtliers
are values that are vastly different from other observations in the data set, they need to be examined for accuracy in the dataset because they may be correct or incorrect.
(Cleaning workflow) Verification
In this step we inspect results to establish effectiveness and accuracy achieved as a result of data cleaning. We need to re-inspect the data to make sure the rules and constraints applicable on the data still hold after the corrections you made.
Documentation
Its the process of write down all changes and reasons behind making changes into a document for referral in the future with similar cases.