Exploratory Data Analysis (EDA) Flashcards
This flashcards checks various EDA methods
What is Exploratory Data Analysis (EDA)?
A process of analyzing and summarizing datasets to uncover patterns, detect anomalies, and check assumptions before modeling.
EDA helps understand the structure and quality of data.
Which library is most commonly used for EDA in Python?
Pandas
Pandas provides powerful tools for data manipulation and summary statistics.
What function in pandas displays the first few rows of a dataset?
df.head()
What function is used to check the number of rows and columns in a dataset?
df.shape
Returns a tuple (rows, columns).
What method provides an overview of the dataset, including column data types and missing values?
df.info()
Useful for checking null values and data types.
Fill in the blank:
The method to calculate summary statistics for numerical columns is ___.
df.describe()
This function provides statistics like mean, median, min, max, and quartiles.
Which pandas function counts unique values in a categorical column?
df[“column”].nunique()
The .nunique() method tells how many unique values exist.
What method checks for missing values in a dataset?
df.isnull().sum()
It returns the count of missing values per column.
True or False:
The .dropna() method removes all rows with missing values.
TRUE
Be cautious when using .dropna(), as it may remove important data.
How do you replace missing values in a column with the mean?
df[“column”].fillna(df[“column”].mean(), inplace=True)
This fills missing values with the column’s mean.
What function in pandas detects duplicate rows?
df.duplicated()
Returns a Boolean series indicating duplicate rows.
Which visualization is used to check the distribution of a numerical column?
Histogram
Histograms show the frequency distribution of numerical values.
What is the purpose of a boxplot in EDA?
To visualize the spread of data and detect outliers.
Boxplots display quartiles and outliers.
Which measure of central tendency is most affected by outliers?
Mean
The mean is pulled in the direction of extreme values.
What statistical measure is used to detect skewness?
Skewness coefficient
A skewness value > 0 indicates right-skewed data; < 0 indicates left-skewed data.
True or False:
A correlation value of 0 means two variables are unrelated.
TRUE
A correlation of 0 indicates no linear relationship, but they may still be related in other ways.
What visualization is commonly used to display correlations?
Heatmap
A heatmap visualizes correlation coefficients using colors.
What does a scatter plot show?
The relationship between two numerical variables.
Useful for identifying trends and correlations.
Fill in the blank:
A categorical variable is best visualized using a ___.
Bar plot
A bar plot shows the count or proportion of categories.
How do you create a scatter plot using Seaborn?
sns.scatterplot(x=”col1”, y=”col2”, data=df)
Scatter plots help visualize trends between two numerical variables.
Which method groups data by a categorical column?
df.groupby(“column”)
Allows aggregation of data by categories.
What is the difference between a histogram and a bar plot?
A histogram is for numerical data, while a bar plot is for **categorical **data.
Histograms group data into bins; bar plots show distinct categories.
True or False:
Outliers should always be removed.
FALSE
Outliers should be analyzed before deciding to remove them.
What statistical method detects outliers based on quartiles?
Interquartile Range (IQR)
Outliers are values beyond Q1 - 1.5IQR or Q3 + 1.5IQR.
What is the range of correlation values?
-1 to 1
A correlation of -1 is a perfect negative relationship, while 1 is a perfect positive relationship.
Which type of chart is best for showing time-series data?
Line plot
Line plots show trends over time.
Fill in the blank:
A __ is a technique used to reduce the number of features while preserving information.
Principal Component Analysis (PCA)
PCA helps reduce dimensionality and improve model efficiency.
Which function in pandas returns the most frequent values in a column?
df[“column”].value_counts()
Useful for analyzing categorical variables.
What is the purpose of feature engineering in EDA?
To create new meaningful features from raw data.
Helps improve model performance.
Which function converts categorical variables into numerical format?
pd.get_dummies()
Creates one-hot encoded variables.
What is multicollinearity?
When two or more independent variables are highly correlated.
Multicollinearity can distort regression models.
Which statistical test is used to check normality?
Shapiro-Wilk test
A p-value < 0.05 suggests non-normal data.
What does kurtosis measure?
The “tailedness” of a distribution.
High kurtosis = heavy tails, low kurtosis = light tails.
How do you detect missing patterns in data?
Using a missing value heatmap (sns.heatmap(df.isnull(), cmap=”viridis”)).
Helps visualize missing data structure.
What is a QQ plot used for?
Checking if data follows a normal distribution.
A straight line indicates normality.
True or False:
Normalization scales data to have a mean of 0 and standard deviation of 1.
FALSE
Normalization scales data between 0 and 1; standardization sets mean = 0, std = 1.
What is data leakage in EDA?
When training data contains future information that shouldn’t be available.
Can cause unrealistic model performance.
What does a pairplot show?
Pairwise relationships between multiple numerical variables.
Created using sns.pairplot(df).
What is the purpose of dimensionality reduction?
To reduce the number of variables while retaining information.
Helps improve computation efficiency.
True or False:
Log transformation can help with skewed data.
TRUE
Log transformations reduce right skewness.