Data Cleaning & Preprocessing Flashcards
This section covers the standard Data Cleaning and Preprocessing approach
What is data cleaning?
The process of identifying and correcting errors in a dataset.
Ensures data quality and reliability.
True or False:
Data cleaning is an optional step in data analysis.
False.
It is a crucial step before analysis.
What does handling missing values mean?
Filling, removing, or imputing missing data.
Common methods: mean imputation, forward fill, etc.
How do you check for missing values in pandas?
df.isnull().sum()
Displays the count of missing values per column.
What is the difference between .dropna() and .fillna()?
.dropna() removes missing values, .fillna() replaces them.
Example: df.fillna(0).
Fill in the blank:
The process of converting data into a structured format is called _______.
Data preprocessing.
Includes cleaning, transformation, and normalization.
What does .duplicated() in pandas return?
A Boolean Series indicating duplicate rows.
Used for duplicate detection.
How do you remove duplicate rows in pandas?
df.drop_duplicates()
Helps eliminate redundant data.
Which technique is used to standardize numerical values?
Normalization or Standardization.
Ensures data has a common scale.
What is the difference between normalization and standardization?
Normalization scales between 0 and 1, standardization centers around the mean (z-score).
Used for different machine learning applications.
How do you remove whitespace from a string column in pandas?
df[“column”] = df[“column”].str.strip()
Removes leading and trailing spaces.
What function converts all text to lowercase in pandas?
.str.lower()
Example: df[“column”] = df[“column”].str.lower().
True or False:
Encoding categorical variables is essential for machine learning models.
True.
Models work better with numerical representations.
What does pd.get_dummies() do?
Converts categorical variables into dummy variables.
Example: pd.get_dummies(df[“Gender”]).
How do you detect outliers in a dataset?
Using boxplots, IQR, or Z-scores.
Outliers can distort statistical results.
What is the interquartile range (IQR)?
The range between the 25th and 75th percentiles.
IQR = Q3 - Q1, used for outlier detection.
How can you replace outliers in a dataset?
- Capping
- Transformation
- Removal
Example: Winsorization limits extreme values.
What is the purpose of feature scaling?
To bring all numerical features to a similar scale.
Helps improve model convergence in ML.
How do you normalize data using Min-Max Scaling?
(X - min) / (max - min)
Scales values between 0 and 1.
Which pandas function replaces values in a column?
.replace()
Example: df[“col”].replace(“old”, “new”).
What is feature engineering?
Creating new features from existing data.
Helps improve model performance.
True or False:
Parsing dates in a dataset improves time-based analysis.
True.
Convert dates using pd.to_datetime().
How do you handle inconsistent data formats?
Standardizing formats using string operations or parsing functions.
Example: df[“date”] = pd.to_datetime(df[“date”]).
What does .apply() do in pandas?
Applies a function to rows or columns.
Example: df[“col”].apply(lambda x: x * 2).
How do you handle text data inconsistencies?
- Lowercasing
- Removing special characters
- Stemming
- Lemmatization
Ensures uniformity in text analysis.
What is lemmatization in text preprocessing?
Reducing words to their root form.
Example: “running” → “run”.
How do you handle imbalanced datasets?
Resampling techniques like oversampling or undersampling.
Helps in classification problems.
Which pandas method allows replacing missing values using interpolation?
.interpolate()
Fills gaps using linear or polynomial methods.
How do you detect incorrect data entries?
- Using domain knowledge
- Range checks
- Anomaly detection.
Example: A negative age value is invalid.
What is the purpose of data augmentation?
Expanding a dataset by generating new variations.
Common in image processing and NLP.