Data Cleaning Flashcards

Question 1

Q

What is data cleaning and why is it important?

Answer

A

Data cleaning is the process of identifying, deleting and/or replacing inconsistent and incorrect information from the dataset. This ensures high quality data and minimises the risk of wrong or inaccurate conclusions.
‘Garbage in - garbage out’

Question 2

Q

What are the stages of data cleaning?

Answer

A

Importing data
Merging datasets
Rebuilding missing data
Standardization
Normalization
Deduplication
Verification and enrichment
Exporting data

Question 3

Q

What are the key properties of data cleaning?

Answer

A

Accuracy
Completeness
Uniformity
Consistency
Relevance
Timeliness
Validity
Uniformity

Question 4

Q

How can we deal with missing data?

Answer

A

Add in a default value, eg an empty string or the mean value of the column (.fillna())
Get rid of all the rows with missing data (.dropna())
Get rid of rows with all missing data (.dropna(how=’all’))
Limit how many non-null values a row needs to have to keep it (data.dropna(thresh=10))
Apply the same to columns using parameter axis = 1

Question 5

Q

How do we normalise data?

Answer

A

Ensure data is stored in the correct type (int, string etc)
Correct casing, get rid of whitespace, rename columns

Question 6

Q

Techniques to identify missing and irregular data?

Answer

A

Visualise via heatmap
Make a list of missing data % for each feature
Create a missing data histogram
Identify outliers using a histogram and box plot or bar chart
Create a list of duplicate features

Question 7

Q

Fill empty values with mean value

Answer

A

ave_price = df.price.mean()
print(ave_price)

df.fillna(ave_price)

Question 8

Q

Check how many values are null in the whole dataframe?

Answer

A

df.isnull().sum()

Question 9

Q

Is it best to fill with mean, median or mode?

Answer

A

Mean-It is preferred if data is numeric and not skewed.
Median-It is preferred if data is numeric and skewed.
Mode-It is preferred if the data is a string(object) or numeric.

Question 10

Q

Datetime object in Pandas

Answer

A

data[“date”] = pd.to_datetime(data[“date”])

Data Cleaning Flashcards

(10 cards)