Data Cleaning & Preprocessing Flashcards by Ikuro Njung'e

What is data cleaning?

The process of identifying and correcting errors in a dataset.

Ensures data quality and reliability.

How well did you know this?

Not at all

Perfectly

True or False:

Data cleaning is an optional step in data analysis.

False.

It is a crucial step before analysis.

How well did you know this?

Not at all

Perfectly

What does handling missing values mean?

Filling, removing, or imputing missing data.

Common methods: mean imputation, forward fill, etc.

How well did you know this?

Not at all

Perfectly

How do you check for missing values in pandas?

df.isnull().sum()

Displays the count of missing values per column.

How well did you know this?

Not at all

Perfectly

What is the difference between .dropna() and .fillna()?

.dropna() removes missing values, .fillna() replaces them.

Example: df.fillna(0).

How well did you know this?

Not at all

Perfectly

Fill in the blank:

The process of converting data into a structured format is called _______.

Data preprocessing.

Includes cleaning, transformation, and normalization.

How well did you know this?

Not at all

Perfectly

What does .duplicated() in pandas return?

A Boolean Series indicating duplicate rows.

Used for duplicate detection.

How well did you know this?

Not at all

Perfectly

How do you remove duplicate rows in pandas?

df.drop_duplicates()

Helps eliminate redundant data.

How well did you know this?

Not at all

Perfectly

Which technique is used to standardize numerical values?

Normalization or Standardization.

Ensures data has a common scale.

How well did you know this?

Not at all

Perfectly

What is the difference between normalization and standardization?

Normalization scales between 0 and 1, standardization centers around the mean (z-score).

Used for different machine learning applications.

How well did you know this?

Not at all

Perfectly

How do you remove whitespace from a string column in pandas?

df[“column”] = df[“column”].str.strip()

Removes leading and trailing spaces.

How well did you know this?

Not at all

Perfectly

What function converts all text to lowercase in pandas?

.str.lower()

Example: df[“column”] = df[“column”].str.lower().

How well did you know this?

Not at all

Perfectly

True or False:

Encoding categorical variables is essential for machine learning models.

True.

Models work better with numerical representations.

How well did you know this?

Not at all

Perfectly

What does pd.get_dummies() do?

Converts categorical variables into dummy variables.

Example: pd.get_dummies(df[“Gender”]).

How well did you know this?

Not at all

Perfectly

How do you detect outliers in a dataset?

Using boxplots, IQR, or Z-scores.

Outliers can distort statistical results.

How well did you know this?

Not at all

Perfectly

What is the interquartile range (IQR)?

Study These Flashcards

The range between the 25th and 75th percentiles.

IQR = Q3 - Q1, used for outlier detection.

How can you replace outliers in a dataset?

Study These Flashcards

Capping
Transformation
Removal

Example: Winsorization limits extreme values.

What is the purpose of feature scaling?

Study These Flashcards

To bring all numerical features to a similar scale.

Helps improve model convergence in ML.

How do you normalize data using Min-Max Scaling?

Study These Flashcards

(X - min) / (max - min)

Scales values between 0 and 1.

Which pandas function replaces values in a column?

Study These Flashcards

.replace()

Example: df[“col”].replace(“old”, “new”).

What is feature engineering?

Study These Flashcards

Creating new features from existing data.

Helps improve model performance.

True or False:

Parsing dates in a dataset improves time-based analysis.

Study These Flashcards

True.

Convert dates using pd.to_datetime().

How do you handle inconsistent data formats?

Study These Flashcards

Standardizing formats using string operations or parsing functions.

Example: df[“date”] = pd.to_datetime(df[“date”]).

What does .apply() do in pandas?

Study These Flashcards

Applies a function to rows or columns.

Example: df[“col”].apply(lambda x: x * 2).

How do you handle **text data inconsistencies**?

* Lowercasing * Removing special characters * Stemming * Lemmatization ## Footnote Ensures uniformity in text analysis.

What is **lemmatization** in text preprocessing?

Reducing words to their root form. ## Footnote Example: "running" → "run".

How do you handle imbalanced datasets?

Resampling techniques like **oversampling** or **undersampling**. ## Footnote Helps in classification problems.

Which pandas method allows replacing missing values using **interpolation**?

**.interpolate()** ## Footnote Fills gaps using linear or polynomial methods.

How do you detect **incorrect data entries**?

* Using domain knowledge * Range checks * Anomaly detection. ## Footnote Example: A negative age value is invalid.

What is the purpose of **data augmentation**?

Expanding a dataset by generating new variations. ## Footnote Common in image processing and NLP.

Data Cleaning & Preprocessing Flashcards

This section covers the standard Data Cleaning and Preprocessing approach (30 cards)