Module 2 Dirty to Clean Data Flashcards
Data Mapping
The process of matching fields from one data source to another.
Viewing how your data has evolved and transformed between one database and another. Integral to success of data migration and integration.
Compatibility
Describes how well two or more datasets are able to work together.
Schema
A way of describing how something is organised.
Primary key
References a column in which each value is unique.
Foreign Key
A field within a table that is a primary key in another table.
CONCATENATE
A function that joins together two or more text strings.
.csv files
.CSV files are plain text files with an organised table structure that includes rows and columns. The values in each row are separated by commas. This table structure makes them easy to understand, edit, manipulate, and use for data analysis.
COUNTIF
Count the number of cells that make the argument you input true.
SPLIT
The split function divides tax around a specific character all string, and puts each fragment of tax into a separate style in the row.
=SPLIT(F2, “-“)..
Your cleaning checklist
- Determine size of the dataset
- Determine the number of categories or labels
- Identify Missing Data
- Identify unformatted data
- Explore the different data types
Data cleaning - Determine size of data set
Large datasets may have more data quality issues and take longer to process. This may impact your choice of data cleaning techniques and how much time to allocate to the project.
Determine number of categories or labels
By understanding the number and nature of categories and labels in a dataset, you can better understand the diversity of the dataset. This understanding also helps inform data merging and migration strategies.
Identify missing data
Recognizing missing data helps you understand data quality so you can take appropriate steps to remediate the problem. Data integrity is important for accurate and unbiased analysis.
Identify unformatted data
Identifying improperly or inconsistently formatted data helps analysts ensure data uniformity. This is essential for accurate analysis and visualization.
Explore the different data types
Understanding the types of data in your dataset (for instance, numerical, categorical, text) helps you select appropriate cleaning methods and apply relevant data analysis techniques.