Exploratory Data Analysis (EDA) Flashcards

This flashcards checks various EDA methods

1
Q

What is Exploratory Data Analysis (EDA)?

A

A process of analyzing and summarizing datasets to uncover patterns, detect anomalies, and check assumptions before modeling.

EDA helps understand the structure and quality of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which library is most commonly used for EDA in Python?

A

Pandas

Pandas provides powerful tools for data manipulation and summary statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What function in pandas displays the first few rows of a dataset?

A

df.head()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What function is used to check the number of rows and columns in a dataset?

A

df.shape

Returns a tuple (rows, columns).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What method provides an overview of the dataset, including column data types and missing values?

A

df.info()

Useful for checking null values and data types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Fill in the blank:

The method to calculate summary statistics for numerical columns is ___.

A

df.describe()

This function provides statistics like mean, median, min, max, and quartiles.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which pandas function counts unique values in a categorical column?

A

df[“column”].nunique()

The .nunique() method tells how many unique values exist.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What method checks for missing values in a dataset?

A

df.isnull().sum()

It returns the count of missing values per column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

True or False:

The .dropna() method removes all rows with missing values.

A

TRUE

Be cautious when using .dropna(), as it may remove important data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you replace missing values in a column with the mean?

A

df[“column”].fillna(df[“column”].mean(), inplace=True)

This fills missing values with the column’s mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What function in pandas detects duplicate rows?

A

df.duplicated()

Returns a Boolean series indicating duplicate rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which visualization is used to check the distribution of a numerical column?

A

Histogram

Histograms show the frequency distribution of numerical values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the purpose of a boxplot in EDA?

A

To visualize the spread of data and detect outliers.

Boxplots display quartiles and outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which measure of central tendency is most affected by outliers?

A

Mean

The mean is pulled in the direction of extreme values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What statistical measure is used to detect skewness?

A

Skewness coefficient

A skewness value > 0 indicates right-skewed data; < 0 indicates left-skewed data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

True or False:

A correlation value of 0 means two variables are unrelated.

A

TRUE

A correlation of 0 indicates no linear relationship, but they may still be related in other ways.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What visualization is commonly used to display correlations?

A

Heatmap

A heatmap visualizes correlation coefficients using colors.

18
Q

What does a scatter plot show?

A

The relationship between two numerical variables.

Useful for identifying trends and correlations.

19
Q

Fill in the blank:

A categorical variable is best visualized using a ___.

A

Bar plot

A bar plot shows the count or proportion of categories.

20
Q

How do you create a scatter plot using Seaborn?

A

sns.scatterplot(x=”col1”, y=”col2”, data=df)

Scatter plots help visualize trends between two numerical variables.

21
Q

Which method groups data by a categorical column?

A

df.groupby(“column”)

Allows aggregation of data by categories.

22
Q

What is the difference between a histogram and a bar plot?

A

A histogram is for numerical data, while a bar plot is for **categorical **data.

Histograms group data into bins; bar plots show distinct categories.

23
Q

True or False:

Outliers should always be removed.

A

FALSE

Outliers should be analyzed before deciding to remove them.

24
Q

What statistical method detects outliers based on quartiles?

A

Interquartile Range (IQR)

Outliers are values beyond Q1 - 1.5IQR or Q3 + 1.5IQR.

25
Q

What is the range of correlation values?

A

-1 to 1

A correlation of -1 is a perfect negative relationship, while 1 is a perfect positive relationship.

26
Q

Which type of chart is best for showing time-series data?

A

Line plot

Line plots show trends over time.

27
Q

Fill in the blank:

A __ is a technique used to reduce the number of features while preserving information.

A

Principal Component Analysis (PCA)

PCA helps reduce dimensionality and improve model efficiency.

28
Q

Which function in pandas returns the most frequent values in a column?

A

df[“column”].value_counts()

Useful for analyzing categorical variables.

29
Q

What is the purpose of feature engineering in EDA?

A

To create new meaningful features from raw data.

Helps improve model performance.

30
Q

Which function converts categorical variables into numerical format?

A

pd.get_dummies()

Creates one-hot encoded variables.

31
Q

What is multicollinearity?

A

When two or more independent variables are highly correlated.

Multicollinearity can distort regression models.

32
Q

Which statistical test is used to check normality?

A

Shapiro-Wilk test

A p-value < 0.05 suggests non-normal data.

33
Q

What does kurtosis measure?

A

The “tailedness” of a distribution.

High kurtosis = heavy tails, low kurtosis = light tails.

34
Q

How do you detect missing patterns in data?

A

Using a missing value heatmap (sns.heatmap(df.isnull(), cmap=”viridis”)).

Helps visualize missing data structure.

35
Q

What is a QQ plot used for?

A

Checking if data follows a normal distribution.

A straight line indicates normality.

36
Q

True or False:

Normalization scales data to have a mean of 0 and standard deviation of 1.

A

FALSE

Normalization scales data between 0 and 1; standardization sets mean = 0, std = 1.

37
Q

What is data leakage in EDA?

A

When training data contains future information that shouldn’t be available.

Can cause unrealistic model performance.

38
Q

What does a pairplot show?

A

Pairwise relationships between multiple numerical variables.

Created using sns.pairplot(df).

39
Q

What is the purpose of dimensionality reduction?

A

To reduce the number of variables while retaining information.

Helps improve computation efficiency.

40
Q

True or False:

Log transformation can help with skewed data.

A

TRUE

Log transformations reduce right skewness.