Data Cleaning Flashcards
What is data cleaning and why is it important?
- Data cleaning is the process of identifying, deleting and/or replacing inconsistent and incorrect information from the dataset. This ensures high quality data and minimises the risk of wrong or inaccurate conclusions.
‘Garbage in - garbage out’
What are the stages of data cleaning?
- Importing data
- Merging datasets
- Rebuilding missing data
- Standardization
- Normalization
- Deduplication
- Verification and enrichment
- Exporting data
What are the key properties of data cleaning?
- Accuracy
- Completeness
- Uniformity
- Consistency
- Relevance
- Timeliness
- Validity
- Uniformity
How can we deal with missing data?
- Add in a default value, eg an empty string or the mean value of the column (.fillna())
- Get rid of all the rows with missing data (.dropna())
- Get rid of rows with all missing data (.dropna(how=’all’))
- Limit how many non-null values a row needs to have to keep it (data.dropna(thresh=10))
- Apply the same to columns using parameter axis = 1
How do we normalise data?
Ensure data is stored in the correct type (int, string etc)
Correct casing, get rid of whitespace, rename columns
Techniques to identify missing and irregular data?
- Visualise via heatmap
- Make a list of missing data % for each feature
- Create a missing data histogram
- Identify outliers using a histogram and box plot or bar chart
- Create a list of duplicate features
Fill empty values with mean value
ave_price = df.price.mean()
print(ave_price)
df.fillna(ave_price)
Check how many values are null in the whole dataframe?
df.isnull().sum()
Is it best to fill with mean, median or mode?
Mean-It is preferred if data is numeric and not skewed.
Median-It is preferred if data is numeric and skewed.
Mode-It is preferred if the data is a string(object) or numeric.
Datetime object in Pandas
data[“date”] = pd.to_datetime(data[“date”])