Exploratory Data Analysis (EDA) Flashcards by Ikuro Njung'e

What is Exploratory Data Analysis (EDA)?

A process of analyzing and summarizing datasets to uncover patterns, detect anomalies, and check assumptions before modeling.

EDA helps understand the structure and quality of data.

How well did you know this?

Not at all

Perfectly

Which library is most commonly used for EDA in Python?

Pandas

Pandas provides powerful tools for data manipulation and summary statistics.

How well did you know this?

Not at all

Perfectly

What function in pandas displays the first few rows of a dataset?

df.head()

How well did you know this?

Not at all

Perfectly

What function is used to check the number of rows and columns in a dataset?

df.shape

Returns a tuple (rows, columns).

How well did you know this?

Not at all

Perfectly

What method provides an overview of the dataset, including column data types and missing values?

df.info()

Useful for checking null values and data types.

How well did you know this?

Not at all

Perfectly

Fill in the blank:

The method to calculate summary statistics for numerical columns is ___.

df.describe()

This function provides statistics like mean, median, min, max, and quartiles.

How well did you know this?

Not at all

Perfectly

Which pandas function counts unique values in a categorical column?

df[“column”].nunique()

The .nunique() method tells how many unique values exist.

How well did you know this?

Not at all

Perfectly

What method checks for missing values in a dataset?

df.isnull().sum()

It returns the count of missing values per column.

How well did you know this?

Not at all

Perfectly

True or False:

The .dropna() method removes all rows with missing values.

TRUE

Be cautious when using .dropna(), as it may remove important data.

How well did you know this?

Not at all

Perfectly

How do you replace missing values in a column with the mean?

df[“column”].fillna(df[“column”].mean(), inplace=True)

This fills missing values with the column’s mean.

How well did you know this?

Not at all

Perfectly

What function in pandas detects duplicate rows?

df.duplicated()

Returns a Boolean series indicating duplicate rows.

How well did you know this?

Not at all

Perfectly

Which visualization is used to check the distribution of a numerical column?

Histogram

Histograms show the frequency distribution of numerical values.

How well did you know this?

Not at all

Perfectly

What is the purpose of a boxplot in EDA?

To visualize the spread of data and detect outliers.

Boxplots display quartiles and outliers.

How well did you know this?

Not at all

Perfectly

Which measure of central tendency is most affected by outliers?

Mean

The mean is pulled in the direction of extreme values.

How well did you know this?

Not at all

Perfectly

What statistical measure is used to detect skewness?

Skewness coefficient

A skewness value > 0 indicates right-skewed data; < 0 indicates left-skewed data.

How well did you know this?

Not at all

Perfectly

True or False:

A correlation value of 0 means two variables are unrelated.

TRUE

A correlation of 0 indicates no linear relationship, but they may still be related in other ways.

How well did you know this?

Not at all

Perfectly

What visualization is commonly used to display correlations?

Study These Flashcards

Heatmap

A heatmap visualizes correlation coefficients using colors.

What does a scatter plot show?

Study These Flashcards

The relationship between two numerical variables.

Useful for identifying trends and correlations.

Fill in the blank:

A categorical variable is best visualized using a ___.

Study These Flashcards

Bar plot

A bar plot shows the count or proportion of categories.

How do you create a scatter plot using Seaborn?

Study These Flashcards

sns.scatterplot(x=”col1”, y=”col2”, data=df)

Scatter plots help visualize trends between two numerical variables.

Which method groups data by a categorical column?

Study These Flashcards

df.groupby(“column”)

Allows aggregation of data by categories.

What is the difference between a histogram and a bar plot?

Study These Flashcards

A histogram is for numerical data, while a bar plot is for **categorical **data.

Histograms group data into bins; bar plots show distinct categories.

True or False:

Outliers should always be removed.

Study These Flashcards

FALSE

Outliers should be analyzed before deciding to remove them.

What statistical method detects outliers based on quartiles?

Study These Flashcards

Interquartile Range (IQR)

Outliers are values beyond Q1 - 1.5IQR or Q3 + 1.5IQR.

What is the **range** of correlation values?

**-1 to 1** ## Footnote A correlation of -1 is a perfect negative relationship, while 1 is a perfect positive relationship.

Which type of chart is best for showing **time-series** data?

**Line plot** ## Footnote Line plots show trends over time.

# Fill in the blank: A __ is a technique used to reduce the number of features while preserving information.

Principal Component Analysis (PCA) ## Footnote PCA helps reduce dimensionality and improve model efficiency.

Which function in pandas returns the **most frequent** values in a column?

df["column"].value_counts() ## Footnote Useful for analyzing categorical variables.

What is the purpose of *feature engineering* in EDA?

To create new meaningful features from raw data. ## Footnote Helps improve model performance.

Which function converts **categorical** variables into **numerical** format?

pd.get_dummies() ## Footnote Creates one-hot encoded variables.

What is **multicollinearity**?

When two or more independent variables are highly correlated. ## Footnote Multicollinearity can distort regression models.

Which statistical test is used to check **normality**?

Shapiro-Wilk test ## Footnote A p-value < 0.05 suggests non-normal data.

What does *kurtosis* measure?

The "**tailedness**" of a distribution. ## Footnote High kurtosis = heavy tails, low kurtosis = light tails.

How do you detect **missing** patterns in data?

Using a missing value heatmap **(sns.heatmap(df.isnull(), cmap="viridis"))**. ## Footnote Helps visualize missing data structure.

What is a **QQ** plot used for?

Checking if data follows a **normal** distribution. ## Footnote A straight line indicates normality.

# True or False: **Normalization** scales data to have a mean of 0 and standard deviation of 1.

**FALSE** ## Footnote Normalization scales data between 0 and 1; standardization sets mean = 0, std = 1.

What is **data leakage** in EDA?

When training data contains **future information** that shouldn't be available. ## Footnote Can cause unrealistic model performance.

What does a **pairplot** show?

Pairwise relationships between multiple numerical variables. ## Footnote Created using sns.pairplot(df).

What is the purpose of **dimensionality reduction**?

To **reduce** the number of variables while retaining information. ## Footnote Helps improve computation efficiency.

# True or False: **Log transformation** can help with skewed data.

**TRUE** ## Footnote Log transformations reduce right skewness.

Exploratory Data Analysis (EDA) Flashcards

This flashcards checks various EDA methods (40 cards)