Data Wrangling Flashcards
What is Data Wrangling?
It is the data exploration, transformation and validation process that involves preparing raw data for a clearly defined purpose, where raw data at this stage is data that has been collated through various data sources in a data repository.
4 Step Process for Data Wrangling
- Discovery.
- Transformation.
- Validation.
- Publishing.
Discovery Phase
Is about examining and understanding your data, base on your use case and creating a plan for cleaning, structuring, organizing, and mapping your data.
Transformation Phase
Forms the whole of data wrangling process. It involves the tasks you undertake to transform the data, such as structuring, normalizing, denormalizing, cleaning, and enriching the data.
(Part of Transformation Phase) Structuring Data
This task includes actions that change the form and schema of your data, which can be in Joins and Unions.
Joins
Combine columns. When columns from the first source table are combined with columns from the second source table, coming up with 1 same column row.
Unions
Combine rows. Rows of data from the first source table are combined with rows of data from the second source table, turning it into a single table.
(Part of Transformation Phase) Normalizing Data
includes cleaning not used data, reducing redundancy, and reducing inconsistency.
(Part of Transformation Phase) Denormalization Data
Is the process of combine data from multiple tables into a single table for faster querying of data for reports and analysis.
(Part of Transformation Phase) Cleaning Data
are actions that fix irregularities in data in order to produce a credible and accurate analysis.
are actions that fix irregularities in data in order to produce a credible and accurate analysis.
is the adding of data points to make your analysis more meaningful.
Validation Phase
Validation rules refer to programming steps used to verify the consistency, quality, and security of the data we have after being structured, normalized, denormalized cleaned, and enriched.
Publishing Phase
Is the transformed and validated version of the dataset along with the metadata about the dataset that would be delivered for downstream project needs.
Some popular Data Wrangling software and tools:
- Excel Spreadsheets.
- OpenRefine.
- Google DataPrep.
- Watson Studio Refinery.
- Trifacta Wrangler.
- Python.
- R.
Excel Spreadsheets
Microsoft Excel and Google Sheets have features and in-built formulas that can help identity issues, clean, and transform data. They allow you to import data from several different sources, cleaning, and transforming data as needed.