Data Cleaning Flashcards

1
Q

What is data cleaning and why is it important?

A
  • Data cleaning is the process of identifying, deleting and/or replacing inconsistent and incorrect information from the dataset. This ensures high quality data and minimises the risk of wrong or inaccurate conclusions.
    ‘Garbage in - garbage out’
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the stages of data cleaning?

A
  • Importing data
  • Merging datasets
  • Rebuilding missing data
  • Standardization
  • Normalization
  • Deduplication
  • Verification and enrichment
  • Exporting data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the key properties of data cleaning?

A
  • Accuracy
  • Completeness
  • Uniformity
  • Consistency
  • Relevance
  • Timeliness
  • Validity
  • Uniformity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can we deal with missing data?

A
  • Add in a default value, eg an empty string or the mean value of the column (.fillna())
  • Get rid of all the rows with missing data (.dropna())
  • Get rid of rows with all missing data (.dropna(how=’all’))
  • Limit how many non-null values a row needs to have to keep it (data.dropna(thresh=10))
  • Apply the same to columns using parameter axis = 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do we normalise data?

A

Ensure data is stored in the correct type (int, string etc)
Correct casing, get rid of whitespace, rename columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Techniques to identify missing and irregular data?

A
  • Visualise via heatmap
  • Make a list of missing data % for each feature
  • Create a missing data histogram
  • Identify outliers using a histogram and box plot or bar chart
  • Create a list of duplicate features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Fill empty values with mean value

A

ave_price = df.price.mean()
print(ave_price)

df.fillna(ave_price)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Check how many values are null in the whole dataframe?

A

df.isnull().sum()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Is it best to fill with mean, median or mode?

A

Mean-It is preferred if data is numeric and not skewed.
Median-It is preferred if data is numeric and skewed.
Mode-It is preferred if the data is a string(object) or numeric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Datetime object in Pandas

A

data[“date”] = pd.to_datetime(data[“date”])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly