T4: EDA Flashcards

Question 1

Q

What is EDA?

Answer

A

A crucial first step in the data analysis process.
Helps understand the data’s main characteristics.

Question 2

Q

Why EDA?

Answer

A

Offers a clear insight into the underlying structure of the data.
Helps identify obvious errors and outliers.
Provides a foundation for subsequent analysis.

Question 3

Q

IMPORTANCE OF EDA

Answer

A

Uncovering patterns: Helps detect and visualize patterns in the data.
Identifying anomalies: Spot potential outliers or mistakes in the data.
Informing model selection: Understand which models might work best.
Validating assumptions: Ensure data meets assumptions required by
modeling techniques.

Question 4

Q

STEPS IN EDA

Answer

A

1) Data Collection: Gathering relevant data from various sources (last time)
2) Data Cleaning: Preparing the data for analysis (last time)
3) Data Visualization: Using plots and charts to understand data
(today)
4) Statistical Analysis: Applying stats to derive insights (next time).

Question 5

Q

1) DATA COLLECTION (RECAP)

Answer

A

Sources of Data: Surveys, databases, logs, etc.
Diverse and Accurate Data: Ensure varied sources for unbiased results.
Initial Observations: First look at raw data for obvious issues or patterns.

Question 6

Q

2) DATA CLEANING (HANDLING MISSING
VALUES)

Answer

A

R Code:
# Identify missing values
missing_values <- is.na(data)

Remove rows with missing values
cleaned_data <- na.omit(data)

Question 7

Q

2) DATA CLEANING (DEALING WITH
OUTLIERS)

Answer

A

Definition: Data points that differ significantly from others.
Types: Point outliers, contextual outliers, and collective outliers.

R Code:
# Boxplot to visualize outliers
boxplot(data$column_name)

IQR method to identify outliers
IQR <- IQR(data$column_name)
upper_bound <- quantile(data$column_name, 0.75) + 1.5 * IQR
lower_bound <- quantile(data$column_name, 0.25) - 1.5 * IQR
outliers < data$column_name[data$column_name > upper_bound |
data$column_name < lower_bound]

Question 8

Q

2) DATA TRANSFORMATION AND
NORMALIZATION

Answer

A

Log transformation (E.g., Reduce skewness Normal dist.) 🡪

Why Transform Data? Enhance model performance; meet assumptions of certain
algorithms.

Common Transformations: Log, square root, z-score.

R Code:
log_data <- log(data$column_name)

Square root transformation
(E.g., Reduce skewness in count data Uniform dist.)
sqrt_data <- sqrt(data$column_name)

Z-score normalization
(E.g., to create mean = 0 and sd = 1).
z_score <- scale(data$column_name)

Question 9

Q

3) DATA VISUALIZATION

Answer

A

Histograms and Box Plots: Understand data distribution.

Scatter Plots: Visualize bivariate relationships.

Heatmaps: Show correlations.

R Code (for a simple scatter plot):

plot(data$column1, data$column2, main=”Scatter Plot of Column1 vs Column2”,
xlab=”Column1”, ylab=”Column2”)

Question 10

Q

4) STATISTICAL ANALYSIS (NEXT TIME)

Answer

A

Descriptive Statistics: Summarize main features of data.
Inferential Statistics: Make predictions or inferences.
Testing Hypotheses: Determine validity of certain claims.

Question 11

Q

DATA VISUALIZATION

Answer

A

Line plot
Bar plots
Box plots
Density plots
Scatter plots
Word clouds
Pie chart
Raincloud plot
Heatmap
Animated plots
(Interactive plots)

Question 12

Q

LINE PLOT

Answer

A

ggplot(data = df, aes(x = date, y = unemploy) +
geom_line()

Question 13

Q

BAR PLOT

Answer

A

ggplot(data = df, aes(x = class) +
geom_bar()

Question 14

Q

BOX PLOT

Answer

A

ggplot(data = df, aes(x = ‘Distance measure’, y = temperature) +
geom_boxplot()

Question 15

Q

DENSITY PLOT

Answer

A

ggplot(data = df, aes(x = X, fill = cut)) +
geom_density(alpha = 0.5)

Question 16

Q

SCATTER PLOT

Answer

Study These Flashcards

A

ggplot(data = df, aes(x = dist, y = speed) +
geom_point()

Question 17

Q

WORD CLOUDS

Answer

Study These Flashcards

A

wordcloud(words = df$word, freq = df$freq,
random.order = FALSE, colors=brewer.pal(8,
“Dark2”))

Question 18

Q

PIE CHART

Answer

Study These Flashcards

A

ggplot(data = df, aes(x = factor(1), fill = as.factor(cyl)) +
geom_bar()

Question 19

Q

RAINCLOUD PLOT

Answer

Study These Flashcards

A

Question 20

Q

HEATMAP

Answer

Study These Flashcards

A

Question 21

Q

ANIMATED PLOTS

Answer

Study These Flashcards

A

Question 22

Q

INTERACTIVE PLOTS

Answer

Study These Flashcards

A

T4: EDA Flashcards

(22 cards)