Exploratory Data Analysis (EDA) Flashcards
This flashcards checks various EDA methods
What is Exploratory Data Analysis (EDA)?
A process of analyzing and summarizing datasets to uncover patterns, detect anomalies, and check assumptions before modeling.
EDA helps understand the structure and quality of data.
Which library is most commonly used for EDA in Python?
Pandas
Pandas provides powerful tools for data manipulation and summary statistics.
What function in pandas displays the first few rows of a dataset?
df.head()
What function is used to check the number of rows and columns in a dataset?
df.shape
Returns a tuple (rows, columns).
What method provides an overview of the dataset, including column data types and missing values?
df.info()
Useful for checking null values and data types.
Fill in the blank:
The method to calculate summary statistics for numerical columns is ___.
df.describe()
This function provides statistics like mean, median, min, max, and quartiles.
Which pandas function counts unique values in a categorical column?
df[“column”].nunique()
The .nunique() method tells how many unique values exist.
What method checks for missing values in a dataset?
df.isnull().sum()
It returns the count of missing values per column.
True or False:
The .dropna() method removes all rows with missing values.
TRUE
Be cautious when using .dropna(), as it may remove important data.
How do you replace missing values in a column with the mean?
df[“column”].fillna(df[“column”].mean(), inplace=True)
This fills missing values with the column’s mean.
What function in pandas detects duplicate rows?
df.duplicated()
Returns a Boolean series indicating duplicate rows.
Which visualization is used to check the distribution of a numerical column?
Histogram
Histograms show the frequency distribution of numerical values.
What is the purpose of a boxplot in EDA?
To visualize the spread of data and detect outliers.
Boxplots display quartiles and outliers.
Which measure of central tendency is most affected by outliers?
Mean
The mean is pulled in the direction of extreme values.
What statistical measure is used to detect skewness?
Skewness coefficient
A skewness value > 0 indicates right-skewed data; < 0 indicates left-skewed data.
True or False:
A correlation value of 0 means two variables are unrelated.
TRUE
A correlation of 0 indicates no linear relationship, but they may still be related in other ways.
What visualization is commonly used to display correlations?
Heatmap
A heatmap visualizes correlation coefficients using colors.
What does a scatter plot show?
The relationship between two numerical variables.
Useful for identifying trends and correlations.
Fill in the blank:
A categorical variable is best visualized using a ___.
Bar plot
A bar plot shows the count or proportion of categories.
How do you create a scatter plot using Seaborn?
sns.scatterplot(x=”col1”, y=”col2”, data=df)
Scatter plots help visualize trends between two numerical variables.
Which method groups data by a categorical column?
df.groupby(“column”)
Allows aggregation of data by categories.
What is the difference between a histogram and a bar plot?
A histogram is for numerical data, while a bar plot is for **categorical **data.
Histograms group data into bins; bar plots show distinct categories.
True or False:
Outliers should always be removed.
FALSE
Outliers should be analyzed before deciding to remove them.
What statistical method detects outliers based on quartiles?
Interquartile Range (IQR)
Outliers are values beyond Q1 - 1.5IQR or Q3 + 1.5IQR.