Data Cleaning & Preprocessing Flashcards

This section covers the standard Data Cleaning and Preprocessing approach

1
Q

What is data cleaning?

A

The process of identifying and correcting errors in a dataset.

Ensures data quality and reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

True or False:

Data cleaning is an optional step in data analysis.

A

False.

It is a crucial step before analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does handling missing values mean?

A

Filling, removing, or imputing missing data.

Common methods: mean imputation, forward fill, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you check for missing values in pandas?

A

df.isnull().sum()

Displays the count of missing values per column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the difference between .dropna() and .fillna()?

A

.dropna() removes missing values, .fillna() replaces them.

Example: df.fillna(0).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Fill in the blank:

The process of converting data into a structured format is called _______.

A

Data preprocessing.

Includes cleaning, transformation, and normalization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does .duplicated() in pandas return?

A

A Boolean Series indicating duplicate rows.

Used for duplicate detection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you remove duplicate rows in pandas?

A

df.drop_duplicates()

Helps eliminate redundant data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which technique is used to standardize numerical values?

A

Normalization or Standardization.

Ensures data has a common scale.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the difference between normalization and standardization?

A

Normalization scales between 0 and 1, standardization centers around the mean (z-score).

Used for different machine learning applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you remove whitespace from a string column in pandas?

A

df[“column”] = df[“column”].str.strip()

Removes leading and trailing spaces.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What function converts all text to lowercase in pandas?

A

.str.lower()

Example: df[“column”] = df[“column”].str.lower().

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

True or False:

Encoding categorical variables is essential for machine learning models.

A

True.

Models work better with numerical representations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does pd.get_dummies() do?

A

Converts categorical variables into dummy variables.

Example: pd.get_dummies(df[“Gender”]).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you detect outliers in a dataset?

A

Using boxplots, IQR, or Z-scores.

Outliers can distort statistical results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the interquartile range (IQR)?

A

The range between the 25th and 75th percentiles.

IQR = Q3 - Q1, used for outlier detection.

17
Q

How can you replace outliers in a dataset?

A
  • Capping
  • Transformation
  • Removal

Example: Winsorization limits extreme values.

18
Q

What is the purpose of feature scaling?

A

To bring all numerical features to a similar scale.

Helps improve model convergence in ML.

19
Q

How do you normalize data using Min-Max Scaling?

A

(X - min) / (max - min)

Scales values between 0 and 1.

20
Q

Which pandas function replaces values in a column?

A

.replace()

Example: df[“col”].replace(“old”, “new”).

21
Q

What is feature engineering?

A

Creating new features from existing data.

Helps improve model performance.

22
Q

True or False:

Parsing dates in a dataset improves time-based analysis.

A

True.

Convert dates using pd.to_datetime().

23
Q

How do you handle inconsistent data formats?

A

Standardizing formats using string operations or parsing functions.

Example: df[“date”] = pd.to_datetime(df[“date”]).

24
Q

What does .apply() do in pandas?

A

Applies a function to rows or columns.

Example: df[“col”].apply(lambda x: x * 2).

25
Q

How do you handle text data inconsistencies?

A
  • Lowercasing
  • Removing special characters
  • Stemming
  • Lemmatization

Ensures uniformity in text analysis.

26
Q

What is lemmatization in text preprocessing?

A

Reducing words to their root form.

Example: “running” → “run”.

27
Q

How do you handle imbalanced datasets?

A

Resampling techniques like oversampling or undersampling.

Helps in classification problems.

28
Q

Which pandas method allows replacing missing values using interpolation?

A

.interpolate()

Fills gaps using linear or polynomial methods.

29
Q

How do you detect incorrect data entries?

A
  • Using domain knowledge
  • Range checks
  • Anomaly detection.

Example: A negative age value is invalid.

30
Q

What is the purpose of data augmentation?

A

Expanding a dataset by generating new variations.

Common in image processing and NLP.